# TalkingData-XGBoost

created by chenlu

We already show u some analysis of raw data in the previous notebook, *EDA.ipynb*.

This notebook constructs a complete pipeline from preprocessing raw data to output final submission

## [Phase 1 : Feature Engineering](#phase1)
1. Sampling data
2. Preprocessing sampled data
3. Generating features, save feature matrix
4. Evaluate feature importance within feature groups
5. Select important features
6. Seperate train/dev/test set

## [Phase 2 : Training Model](#phase2)
1. Preprocessing full data
2. Generating features selected from last phrase
3. Seperate train/dev/test set
4. Negative down-sampling (posive : negative = 1 : 1)(3 down-sampled datasets)
5. Experiments on the effect of down-sampling
6. Training 3 XGBoost models on 3 sampled datasets
7. Tuning the model

## [Phase 3 : Predict & Output](#phase3)
1. Predict test set using 3 well-train models
2. Generate 3 submission files

Note:

    train.csv -> 184,903,891 rows
    
    test.csv -> 18,790,470 rows

<a id='phase1'></a>
## Phase1 : Feature Engineering

<a id='sampling_data'>1.Sampling data</a>

Because of big unbalanced raw data, first we reserved full positive cases and sampled 10% percent of negative cases from train data based on the proportion of Day 7 & 8 & 9. 

And take Day 7&8 as train set, Day 9 as dev set.

In [None]:
%load_ext autoreload
%autoreload 2
import gc
import time
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
import xgboost as xgb
from xgboost import plot_importance
import matplotlib.pyplot as plt
import pickle
from generate_features import *
from model import *

path = '../input/'

train_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'is_attributed']
test_columns  = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'click_id']
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32',
        }

In [None]:
start_time = time.time()
# Read the last lines because they are more impacting in training than the starting lines
train = pd.read_csv(f"{path}train.csv", usecols=train_columns, dtype=dtypes, parse_dates=['click_time'])
test = pd.read_csv(f"{path}test.csv", usecols=test_columns, dtype=dtypes, parse_dates=['click_time'])
print(f'[{time.time() - start_time}] Finished to load data')

In [None]:
train = time_features(train)
gc.collect()
test = time_features(test)
gc.collect()

frac = 0.05
train_7 = pd.concat([train[(train['is_attributed']==0) & (train['day']==7)].sample(frac=frac),\
                     train[(train['is_attributed']==1) & (train['day']==7)]]).reset_index(drop=True)
train_8 = pd.concat([train[(train['is_attributed']==0) & (train['day']==8)].sample(frac=frac),\
                     train[(train['is_attributed']==1) & (train['day']==8)]]).reset_index(drop=True)
train_9 = pd.concat([train[(train['is_attributed']==0) & (train['day']==9)].sample(frac=frac),\
                     train[(train['is_attributed']==1) & (train['day']==9)]]).reset_index(drop=True)

X_train = pd.concat([train_7,train_8])
del train, train_7,train_8
gc.collect()

# sampled train data
train = pd.concat([X_train,train_9])
del X_train, train_9
gc.collect()

# X_total = train + test
test['is_attributed'] = np.nan
X_total = pd.concat([train,test.drop(['click_id'], axis=1)])
del train, test
gc.collect()

In [None]:
X_total.to_pickle('intermediate/X_total.pkl.gz')

In [None]:
X_total.head()

<a id='generating_features'>
3.Generating features
</a>

If you already run the previous cells, you can start here.

Training XGBoost Model

Save feature matrix

In [None]:
# X_total = pd.read_pickle('X_total.pkl.gz')
X_total = pd.read_pickle('intermediate/X_total.pkl.gz')
plot_df = pd.DataFrame()

baseline

In [None]:
X_total_ = X_total
clf, evals_result = xgb_train(X_total_)
plot_df['base'] = evals_result['validation_0']['auc']
gc.collect()

In [None]:
feature_name = "base"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

clicks by ip

In [None]:
X_total_ = clicks_by_ip(X_total)
clf, evals_result = xgb_train(X_total_)
plot_df['clicks_by_ip'] = evals_result['validation_0']['auc']
gc.collect()

In [None]:
feature_name = "clicks_by_ip"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

confidence rate feature

In [None]:
X_train = X_total[X_total['day'] != 10]
X_total_ = confidence_rate_feature(X_total, X_train)
clf, evals_result = xgb_train(X_total_)
plot_df['confidence_rate_feature'] = evals_result['validation_0']['auc']
gc.collect()

In [None]:
feature_name = "confidence_rate_feature"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

group by feature

In [None]:
X_total_ = group_by_feature(X_total)
clf, evals_result = xgb_train(X_total_)
plot_df['group_by_feature'] = evals_result['validation_0']['auc']
gc.collect()

In [None]:
feature_name = "group_by_feature"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

next click feature (very slow)

In [None]:
X_total_ = next_click_feature(X_total)
clf, evals_result = xgb_train(X_total_)
plot_df['next_click_feature'] = evals_result['validation_0']['auc']
gc.collect()

In [None]:
feature_name = "next_click_feature"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

history click feature

In [None]:
X_total_ = history_click_feature(X_total)
clf, evals_result = xgb_train(X_total_)
plot_df['history_click_feature'] = evals_result['validation_0']['auc']
gc.collect() 

In [None]:
feature_name = "history_click_feature"
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
feat_mat = f"intermediate/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
X_total_.to_pickle(feat_mat)

topic feature (very slow)

In [None]:
# X_total_ = topic_feature(X_total)
clf, evals_result = easy_train(X_total)
plot_df['topic_feature'] = evals_result['validation_0']['auc']
gc.collect()

feature_name = "topic_feature"
modelname = f"model/{feature_name}_clf.pkl.gz"
feat_mat = f"feature/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
# X_total_.to_pickle(feat_mat)

In [None]:
feature_name = "topic_feature"
modelname = f"model/{feature_name}_clf.pkl.gz"
feat_mat = f"feature/{feature_name}.pkl.gz"
pickle.dump(clf, open(modelname, 'wb'))
# X_total_.to_pickle(feat_mat)

plot the improvement of model because of features 

In [None]:
feature_name = 'topic_feature'
modelname = f"intermediate/{feature_name}_clf.pkl.gz"
clf = pickle.load(open(modelname, 'rb'))
plot_df[feature_name] = clf.evals_result()['validation_0']['auc']
# plot_df.to_pickle('plot_df.pkl.gz')
# plot_df = pd.read_pickle('plot_df.pkl.gz')

In [None]:
plot_df.head(3)

In [None]:
labels = ['base', 'base+clicks_by_ip', 'base+conf_rate', 
          'base+group_by',  'base+next_click', 
          'base+history_click', 'base_topic']
plt.boxplot(plot_df.T, showfliers=False, labels=labels, vert=True)
plt.grid(True)
plt.xticks(rotation=20)
plt.ylabel('AUC')
plt.show()

### evaluate feature importance within feature groups

In [None]:
def eval_within_groups(feature_name):
    feat_mat = f"feature/{feature_name}.pkl.gz"
    modelname = f"model/{feature_name}_clf.pkl.gz"
    clf = pickle.load(open(modelname, 'rb'))
    X_total = pd.read_pickle(feat_mat)
    X_total = X_total[X_total['day']!=10]

    # Get xgBoost importances
    importance_dict = {}
    for import_type in ['weight']:
        importance_dict['xgBoost-'+import_type] = clf.get_booster().get_fscore()

    # MinMax scale all importances
    importance_df = pd.DataFrame(importance_dict).fillna(0)
    importance_df = pd.DataFrame(
        preprocessing.MinMaxScaler().fit_transform(importance_df),
        columns=importance_df.columns,
        index=importance_df.index
    )

    sum_features = sum(importance_df['xgBoost-weight'])
    importance_df = importance_df.sort_values('xgBoost-weight',ascending=False)
    weight_list = importance_df['xgBoost-weight']
    col_70, col_80, col_90 = [], [], []

    cur_sum = 0
    for i, col in enumerate(importance_df.index):
        cur_sum += weight_list[i]
        if 1.0*cur_sum/sum_features < 0.7:
            col_70.append(col)
        if 1.0*cur_sum/sum_features < 0.8:
            col_80.append(col)
        if 1.0*cur_sum/sum_features < 0.9:   
            col_90.append(col)

    if 'day' not in col_70:
        col_70.append('day')
    if 'day' not in col_80:
        col_80.append('day')
    if 'day' not in col_90:
        col_90.append('day')    
    if 'is_attributed' not in col_70:
        col_70.append('is_attributed')
    if 'is_attributed' not in col_80:
        col_80.append('is_attributed')
    if 'is_attributed' not in col_90:
        col_90.append('is_attributed')     

    print(len(col_70), len(col_80), len(col_90), len(importance_df.index))

    plot_df = pd.DataFrame()
    
    clf = pickle.load(open(f'model/{feature_name}_70_clf.pkl.gz', 'rb'))
    plot_df['70'] = clf.evals_result()['validation_0']['auc']
    gc.collect()
    print("70 done")
    clf, evals_result = xgb_train(X_total[col_80])
    pickle.dump(clf, open(f'model/{feature_name}_80_clf.pkl.gz', 'wb'))
    plot_df['80'] = evals_result['validation_0']['auc']
    gc.collect()
    print("80 done")
    clf, evals_result = xgb_train(X_total[col_90])
    pickle.dump(clf, open(f'model/{feature_name}_90_clf.pkl.gz', 'wb'))
    plot_df['90'] = evals_result['validation_0']['auc']
    gc.collect()
    print("90 done")
    clf = pickle.load(open(modelname, 'rb'))
    plot_df['100'] = clf.evals_result()['validation_0']['auc']
    gc.collect()
    
    labels = ['70%', '80%', '90%', '100%']
    plt.boxplot(plot_df.T, showfliers=False, labels=labels, vert=True)
    plt.grid(True)
    plt.xticks(rotation=20)
    plt.ylabel('AUC')
    plt.title(f'evaluation within {feature_name} groups')
    plt.savefig(f'evaluation_within_{feature_name}_groups.png')
    plt.show()

confidence_rate_feature

In [None]:
eval_within_groups('confidence_rate_feature')

group_by_feature

In [None]:
eval_within_groups('group_by_feature')

next_click_feature

In [None]:
eval_within_groups('next_click_feature')

history_click_feature

In [None]:
eval_within_groups('history_click_feature')

topic_feature

In [None]:
eval_within_groups('topic_feature')

### evaluate feature importance

The feature importances are MinMax scaled, put into a DataFrame, and finally plotted ordered by the mean feature importance.

xgboost-weight = importance

In [None]:
def eval_important_features(feature_name, figzise):
    modelname = f"intermediate/{feature_name}_clf.pkl.gz"
    clf = pickle.load(open(modelname, 'rb'))

    fig, ax = plt.subplots(1,1,figsize=figzise)
    plot_importance(clf, ax=ax)
    plt.show()

    plot_feature_importance(clf)

base

In [None]:
eval_important_features('base',[8, 5])

clicks_by_ip

In [None]:
eval_important_features('clicks_by_ip',[8,5])

confidence_rate_feature

In [None]:
eval_important_features('confidence_rate_feature',[8,5])

group_by_feature

In [None]:
eval_important_features('group_by_feature',[8,5])

next_click_feature

In [None]:
eval_important_features('next_click_feature',[8,5])

history_click_feature

In [None]:
eval_important_features('history_click_feature',[8,5])

topic_feature

In [None]:
eval_important_features('topic_feature',[13,50])

<a id='phase2'></a>
## Phase 2 : Training Model

1.Preprocessing full data

In [None]:
# Read the last lines because they are more impacting in training than the starting lines
train = pd.read_csv(path+"train.csv", usecols=train_columns, dtype=dtypes, parse_dates=['click_time'])
test = pd.read_csv(path+"test.csv", usecols=test_columns, dtype=dtypes, parse_dates=['click_time'])
print('[{}] Finished to load data'.format(time.time() - start_time))

In [None]:
train = time_features(train)
gc.collect()
test = time_features(test)
gc.collect()

test['is_attributed'] = np.nan
X_total = pd.concat([train,test.drop(['click_id'], axis=1)])
# combine clicks by ip into X_total
X_total_ = clicks_by_ip(X_total)
del train, test
gc.collect()

In [None]:
feature_name = "confidence_rate_feature"
X_total_ = confidence_rate_feature(X_total)
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_.to_pickle(feat_mat)
gc.collect()

In [None]:
feature_name = "group_by_feature"
X_total_ = group_by_feature(X_total)
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_.to_pickle(feat_mat)
gc.collect()

In [None]:
feature_name = "next_click_feature"
X_total_ = next_click_feature(X_total)
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_.to_pickle(feat_mat)
gc.collect()

In [None]:
feature_name = "history_click_feature"
X_total_ = history_click_feature(X_total)
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_.to_pickle(feat_mat)
gc.collect()

In [None]:
feature_name = "topic_feature"
X_total_ = topic_feature(X_total)
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_.to_pickle(feat_mat)
gc.collect()

2.Generating features selected from last phrase

In [None]:
# features
basic = ['ip','hour','minute']

conf_rate_v1 = ['app_confRate','channel_confRate','ip_confRate']
conf_rate_v2 = ['app_channel_confRate','app_os_confRate','app_device_confRate']
conf_rate_v3 = ['channel_device_confRate','channel_os_confRate','os_device_confRate']

group_by_v1 = ['ip_day_hour_count_channel']
group_by_v2 = ['app_count_channel','channel_count_app']
group_by_v3 = ['ip_cumcount_app','ip_nunique_device','ip_device_os_cumcount_app','ip_nunique_app','ip_nunique_channel']

next_click_v1 = ['ip_nextClick','ip_app_nextClick','ip_os_nextClick','ip_channel_nextClick']
next_click_v3 = ['ip_os_device_app_nextClick','ip_app_device_os_channel_nextClick']

history_click = ['future_app_clicks','future_identical_clicks']

# choose
chosen_features = basic+conf_rate_v1+conf_rate_v2+conf_rate_v3 \
                    +group_by_v1+group_by_v2+group_by_v3 \
                    +next_click_v1+next_click_v3 \
                    +history_click

### experiments on features importance

In [None]:
def eval_within_groups(feature_name):
    feat_mat = f"feature/{feature_name}.pkl.gz"
    modelname = f"model/{feature_name}_clf.pkl.gz"
    clf = pickle.load(open(modelname, 'rb'))
    X_total = pd.read_pickle(feat_mat)
    X_total = X_total[X_total['day']!=10]

    # Get xgBoost importances
    importance_dict = {}
    for import_type in ['weight']:
        importance_dict['xgBoost-'+import_type] = clf.get_booster().get_fscore()

    # MinMax scale all importances
    importance_df = pd.DataFrame(importance_dict).fillna(0)
    importance_df = pd.DataFrame(
        preprocessing.MinMaxScaler().fit_transform(importance_df),
        columns=importance_df.columns,
        index=importance_df.index
    )

    sum_features = sum(importance_df['xgBoost-weight'])
    importance_df = importance_df.sort_values('xgBoost-weight',ascending=False)
    weight_list = importance_df['xgBoost-weight']
    col_70, col_80, col_90 = [], [], []

    cur_sum = 0
    for i, col in enumerate(importance_df.index):
        cur_sum += weight_list[i]
        if 1.0*cur_sum/sum_features < 0.7:
            col_70.append(col)
        if 1.0*cur_sum/sum_features < 0.8:
            col_80.append(col)
        if 1.0*cur_sum/sum_features < 0.9:   
            col_90.append(col)

    if 'day' not in col_70:
        col_70.append('day')
    if 'day' not in col_80:
        col_80.append('day')
    if 'day' not in col_90:
        col_90.append('day')    
    if 'is_attributed' not in col_70:
        col_70.append('is_attributed')
    if 'is_attributed' not in col_80:
        col_80.append('is_attributed')
    if 'is_attributed' not in col_90:
        col_90.append('is_attributed')     

    print(len(col_70), len(col_80), len(col_90), len(importance_df.index))

    plot_df = pd.DataFrame()
    clf, evals_result = xgb_train(X_total[col_70])
    pickle.dump(clf, open(f'model/{feature_name}_70_clf.pkl.gz', 'wb'))
    plot_df['70'] = evals_result['validation_0']['auc']
    gc.collect()
    print("70 done")
    clf, evals_result = xgb_train(X_total[col_80])
    pickle.dump(clf, open(f'model/{feature_name}_80_clf.pkl.gz', 'wb'))
    plot_df['80'] = evals_result['validation_0']['auc']
    gc.collect()
    print("80 done")
    clf, evals_result = xgb_train(X_total[col_90])
    pickle.dump(clf, open(f'model/{feature_name}_90_clf.pkl.gz', 'wb'))
    plot_df['90'] = evals_result['validation_0']['auc']
    gc.collect()
    print("100 done")
    clf = pickle.load(open(modelname, 'rb'))
    plot_df['100'] = clf.evals_result()['validation_0']['auc']
    gc.collect()
    
    labels = ['70%', '80%', '90%', '100%']
    plt.boxplot(plot_df.T, showfliers=False, labels=labels, vert=True)
    plt.grid(True)
    plt.xticks(rotation=20)
    plt.ylabel('AUC')
    plt.title(f'evaluation within {feature_name} groups')
    plt.show()


In [None]:
eval_within_groups('confidence_rate_feature')

In [None]:
eval_within_groups('group_by_feature')

In [None]:
eval_within_groups('next_click_feature')

In [None]:
eval_within_groups('history_click_feature')

In [None]:
eval_within_groups('topic_feature')

### Feature Selection

In [None]:
cur_sum = 0
for i, col in enumerate(importance_df.index):
    cur_sum += weight_list[i]
    if 1.0*cur_sum/sum_features < 0.9:   
        col_90.append(col)

In [None]:
chosen_features = col_90
feature_name = "confidence_rate_feature"
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_ = pd.read_pickle(feat_mat)
X_chosen = X_total_[chosen_features]
X_total = pd.concat([X_total,X_chosen],axis=1) 
gc.collect()

In [None]:
chosen_features = col_90
feature_name = "group_by_feature"
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_ = pd.read_pickle(feat_mat)
X_chosen = X_total_[chosen_features]
X_total = pd.concat([X_total,X_chosen],axis=1) 
gc.collect()

In [None]:
chosen_features = col_90
feature_name = "next_click_feature"
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_ = pd.read_pickle(feat_mat)
X_chosen = X_total_[chosen_features]
X_total = pd.concat([X_total,X_chosen],axis=1) 
gc.collect()

In [None]:
chosen_features = col_90
feature_name = "history_click_feature"
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_ = pd.read_pickle(feat_mat)
X_chosen = X_total_[chosen_features]
X_total = pd.concat([X_total,X_chosen],axis=1) 
gc.collect()

In [None]:
chosen_features = col_90
feature_name = "topic_feature"
feat_mat = f"feature/{feature_name}.pkl.gz"
X_total_ = pd.read_pickle(feat_mat)
X_chosen = X_total_[chosen_features]
X_total = pd.concat([X_total,X_chosen],axis=1) 
gc.collect()

separate train/dev/test set

In [None]:
X_total = pd.read_pickle('X_total_fea.pkl.gz')

In [None]:
X_total.drop(['click_time'], axis=1, inplace=True)
X_total = X_total.fillna(0)
X_train = X_total[X_total['day']!=9]
X_test = X_total[X_total['day']==9]

4.Negative down-sampling (posive : negative = 1 : 1)

sample 3 different down-sampling dataset

In [None]:
pos_train = X_train[X_train['is_attributed']==1]
neg_train = X_train[X_train['is_attributed']==0]
n_pos_train = len(pos_train)

sampled_data = {}

for i in range(3):
    sampled_data[i] = pd.concat([neg_train.sample(n=n_pos_train), pos_train]).reset_index(drop=True)
del pos_train, neg_train
gc.collect()

In [None]:
len(sampled_data[0].columns)

### experiment on the effect of down-sampling

In [None]:
plot_df = pd.DataFrame()

for i in range(6):
    clf, evals_result = xgb_train(sampled_data[i])
    plot_df[f'neg:pos={i+1}:1'] = evals_result['validation_0']['auc']
    gc.collect()
    modelname = f"model/neg_pos_{i+1}_1_clf.pkl.gz"
    pickle.dump(clf, open(modelname, 'wb'))
    print(f"neg:pos={i+1}:1 done")

In [None]:
labels = ['neg:pos=1:1','neg:pos=2:1','neg:pos=3:1','neg:pos=4:1','neg:pos=5:1']
plt.boxplot(plot_df[['neg:pos=2:1','neg:pos=3:1','neg:pos=4:1','neg:pos=5:1','neg:pos=6:1']].T, showfliers=False, labels=labels, vert=True)
plt.grid(True)
plt.xticks(rotation=0)
plt.ylabel('AUC')
plt.title(f'different proportions of negative down-sampling')
plt.show()

5.Training XGBoost Model on 3 different sampled datasets

In [None]:
import pickle
for i in range(3):
    X = sampled_data[i].drop(['is_attributed'], axis=1)
    y = sampled_data[i]['is_attributed']
    model = grid_search_cv(X, y)
    filename = f'model/model_{i}.sav'
    pickle.dump(model, open(filename, 'wb'))
gc.collect()

<a id='phase3'></a>
## Phase 3 : Predict & Evaluate & Plot

1.Predict test set using well-trained model

2.Generate submission file

In [None]:
# sample = pd.read_csv('../input/sample_submission.csv')
eval_result = pd.DataFrame()
eval_result['label'] = X_test['is_attributed']
for i in range(3):
    filename = f'model/model_{i}.sav'
    clf = pickle.load(open(filename, 'rb'))
    eval_result[f'pred_prob_{i}'] = clf.predict_proba(X_test.drop(['is_attributed'], axis=1).fillna(0))[:,1]
#     sample.is_attributed = test_probs
#     sample.to_csv(f"submission/xgboost-chenlu-{i}.csv", index=False)

evaluate the result and plot the PR curve

In [None]:
ensemble = pd.read_csv('submission_esb_1x_3n.csv')
ensemble_3nn = pd.read_csv('submission_esb_3n.csv')
ensemble_1x_3nn = pd.read_csv('submission_esb_1x_3n_newauc.csv')
ensemble_1x_1nn = pd.read_csv('submission_esb_1x_1n.csv')
szh_label = pd.read_csv('y_true.csv')
szh_pred1 = pd.read_csv('submission/y_pred1.csv')
szh_pred2 = pd.read_csv('submission/y_pred2.csv')
szh_pred3 = pd.read_csv('submission/y_pred3.csv')

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))

def add_line(label, predict, line_name):
    y_label = label
    y_prob = predict
    
    precision, recall, _ = precision_recall_curve(y_label, y_prob)
    average_precision = average_precision_score(y_label, y_prob)
    
    plt.step(recall, precision, alpha=0.8, label=f'{line_name} pr_auc:{average_precision:0.6f}', where='post')
    

plt.sca(axs)
add_line(eval_result['label'], eval_result['pred_prob_0'], "xgb_1")
add_line(eval_result['label'], szh_pred1['0'], "DNN_1")
add_line(eval_result['label'], szh_pred2['0'], "DNN_2")
add_line(eval_result['label'], szh_pred3['0'], "DNN_3")
add_line(eval_result['label'], ensemble_1x_3nn['is_attributed'],'ensemble_1x_3nn')
add_line(eval_result['label'], ensemble_3nn['is_attributed'],'ensemble_3nn')
add_line(eval_result['label'], ensemble_1x_1nn['is_attributed'],'ensemble_1x_1nn')


# add_line(eval_result['label'], eval_result['pred_prob_3'], "model_3")

plt.legend(loc=0, prop={'size': 12})

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

plt.title(f'PR Curve in test set')