# **Combining model with a model without outlier**

Assuming that you have already finished your feature engineering and you have two dataset:

- ***train_clean.csv***
- ***test_clean.csv***

In train_clean.csv, there's an **'outlier' column with values 1/0. **

Besides, you have your best LB submission:
- ***3.695.csv*** (thanks  **Ashish Patel** My original model can't rich this score, so I try to use the idea to improve your submission to get better LB socre.）

The flows of this pipline is as follows:
1. Training a model using a training set without outliers. (we get: **Model_1**)
2. Training a model to classify outliers. (we get: **Model_2**)
3. Using **Model_2** to predict whether an card_id in test set is an outliers. (we get:**Outlier_Likelyhood**)
4. Spliting out the card_id from **Outlier_Likelyhood** with top 10% (or some other ratio) score. (we get:**Outlier_ID**)
5. Combining your submission using your **best submission (that is, your best model)** to predict **Outlier_ID** in test set and using **Model_1** to predict the rest of the test set.

The  basic idea behind this pipline is:
1. Training model without outliers make the model more accurate for non-outliers.
2. A great proportion of the error is caused by outliers, so we need to use a model training with outliers to predict them. How to find them out? build a classifier!

In [0]:
import numpy as np
import pandas as pd
import time
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss

## **I. Training Model Without Outliers**

In [2]:
%%time
df_train = pd.read_csv('/content/train_clean.csv')
df_test = pd.read_csv('/content/test_clean.csv')

CPU times: user 3.94 s, sys: 276 ms, total: 4.21 s
Wall time: 4.22 s


### Filtering out outliers

In [0]:
df_train = df_train[df_train['outliers'] == 0]
target = df_train['target']
del df_train['target']
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','outliers']]
categorical_feats = [c for c in features if 'feature_' in c]

### Parameters

In [0]:
param = {'objective':'regression',
         'num_leaves': 31,
         'min_data_in_leaf': 25,
         'max_depth': 7,
         'learning_rate': 0.01,
         'lambda_l1':0.13,
         "boosting": "gbdt",
         "feature_fraction":0.85,
         'bagging_freq':8,
         "bagging_fraction": 0.9 ,
         "metric": 'rmse',
         "verbosity": -1,
         "random_state": 2333}

### Training model

In [5]:
%%time
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2333)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx]))
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx]))

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval= 100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold 0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 1.60848	valid_1's rmse: 1.6135
[200]	training's rmse: 1.57816	valid_1's rmse: 1.58471
[300]	training's rmse: 1.56375	valid_1's rmse: 1.57287
[400]	training's rmse: 1.5544	valid_1's rmse: 1.56657
[500]	training's rmse: 1.54735	valid_1's rmse: 1.56276
[600]	training's rmse: 1.54165	valid_1's rmse: 1.56036
[700]	training's rmse: 1.53677	valid_1's rmse: 1.55887
[800]	training's rmse: 1.53226	valid_1's rmse: 1.55788
[900]	training's rmse: 1.52817	valid_1's rmse: 1.55719
[1000]	training's rmse: 1.52434	valid_1's rmse: 1.55676
[1100]	training's rmse: 1.52081	valid_1's rmse: 1.55641
[1200]	training's rmse: 1.51732	valid_1's rmse: 1.55626
[1300]	training's rmse: 1.51413	valid_1's rmse: 1.55605
[1400]	training's rmse: 1.51097	valid_1's rmse: 1.5559
[1500]	training's rmse: 1.50777	valid_1's rmse: 1.55583
[1600]	training's rmse: 1.50459	valid_1's rmse: 1.55568
[1700]	training's rmse: 1.50144	valid_1's rmse

In [0]:
model_without_outliers = pd.DataFrame({"card_id":df_test["card_id"].values})
model_without_outliers["target"] = predictions

## **II. Training Model For Outliers Classification**

In [7]:
%%time
df_train = pd.read_csv('/content/train_clean.csv')
df_test = pd.read_csv('/content/test_clean.csv')

CPU times: user 3.83 s, sys: 64.9 ms, total: 3.89 s
Wall time: 3.9 s


### Using outliers column as labels instead of target column

In [0]:
target = df_train['outliers']
del df_train['outliers']
del df_train['target']

In [0]:
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month']]
categorical_feats = [c for c in features if 'feature_' in c]

### Parameters

In [0]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 6,
         'learning_rate': 0.01,
         "boosting": "rf",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "random_state": 2333}

### Training model

In [11]:
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

start = time.time()


for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(log_loss(target, oof)))

fold n°0




Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.17393	valid_1's binary_logloss: 0.184568
[200]	training's binary_logloss: 0.171317	valid_1's binary_logloss: 0.182166
[300]	training's binary_logloss: 0.169496	valid_1's binary_logloss: 0.180408
[400]	training's binary_logloss: 0.169496	valid_1's binary_logloss: 0.180203
[500]	training's binary_logloss: 0.170259	valid_1's binary_logloss: 0.180901
Early stopping, best iteration is:
[362]	training's binary_logloss: 0.169132	valid_1's binary_logloss: 0.179841
fold n°1
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.184874	valid_1's binary_logloss: 0.175541
[200]	training's binary_logloss: 0.183896	valid_1's binary_logloss: 0.174741
[300]	training's binary_logloss: 0.180105	valid_1's binary_logloss: 0.171359
[400]	training's binary_logloss: 0.180279	valid_1's binary_logloss: 0.171417
[500]	training's binary_logloss: 0.181017	valid_1's binary_lo

In [12]:
# 'target' is the probability of whether an observation is an outlier
df_outlier_prob = pd.DataFrame({"card_id":df_test["card_id"].values})
df_outlier_prob["target"] = predictions
df_outlier_prob.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,0.191482
1,C_ID_130fd0cbdd,0.004683
2,C_ID_b709037bc5,0.007254
3,C_ID_d27d835a9f,0.004683
4,C_ID_2b5e3df5c2,0.004683


## **III. Combining Submission**:
So far so good !
We now have three dataset:

1. Best Submission
2. Prediction Using Model Without Outliers
3. Probability of Outliers In Test set


If the test set has the same ratio of outliers as training set, then the numbuer of outliers in test is about: (1.06% outliers in training set)
123623*0.0106

In [0]:
# In case missing some predictable outlier, we choose top 25000 with highest outliers likelyhood.
outlier_id = pd.DataFrame(df_outlier_prob.sort_values(by='target',ascending = False).head(25000)['card_id'])

In [0]:
best_submission = pd.read_csv('/content/3.695.csv')

In [16]:
most_likely_liers = best_submission.merge(outlier_id,how='right')
most_likely_liers.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,-2.502326
1,C_ID_6d8dba8475,-0.893964
2,C_ID_7f1041e8e1,-4.872942
3,C_ID_22e4a47c72,0.393946
4,C_ID_b54cfad8b2,-0.656266


In [17]:
%%time
for card_id in most_likely_liers['card_id']:
    model_without_outliers.loc[model_without_outliers['card_id']==card_id,'target']\
    = most_likely_liers.loc[most_likely_liers['card_id']==card_id,'target'].values

CPU times: user 3min 56s, sys: 563 ms, total: 3min 56s
Wall time: 3min 56s


In [0]:
model_without_outliers.to_csv("model_no_outlier.csv", index=False)