## Estimating the Cost with Cross-Validation

We mentioned that there are 3 ways of estimating the cost:

- Domain Expert provides the cost
- Balance Ratio (we did this in previous notebook)
- Cross-validation: find cost as hyper-parameter

In this notebook, we will find the cost with hyper parameter search and cross-validation.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
# load data
# only a few observations to speed the computaton

data = pd.read_csv('../kdd2004.csv').sample(10000)

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
99077,97.2,24.21,2.43,39.5,-73.0,3359.4,1.02,1.25,28.5,-171.0,...,4101.3,-1.0,0.61,-9.0,-197.0,970.5,0.63,0.07,-0.35,-1
127475,32.17,29.55,0.23,-62.5,38.5,969.5,0.27,-1.55,-6.5,-31.5,...,493.3,2.11,0.8,1.0,-23.0,-5.4,1.04,0.43,-0.03,-1
44961,66.98,26.28,-0.66,-11.0,1.0,1499.6,0.4,0.08,5.0,-86.0,...,1727.8,-0.67,0.13,-4.0,-87.0,459.1,2.11,0.44,0.77,-1
68588,76.0,22.92,0.88,43.0,-9.5,1985.5,1.13,0.8,4.0,-96.0,...,2542.9,1.6,5.41,9.0,-160.0,171.4,1.95,0.43,0.83,1
123681,60.69,25.27,-0.06,27.5,15.5,1003.4,0.76,-0.33,-7.5,-58.5,...,2001.0,-1.7,1.68,1.0,-61.0,223.1,1.29,0.26,0.67,-1


In [3]:
# imbalanced target

data.target.value_counts() / len(data)

target
-1    0.9906
 1    0.0094
Name: count, dtype: float64

In [4]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

In [5]:
# set up initial random forest

rf = RandomForestClassifier(n_estimators=50,
                            random_state=39,
                            max_depth=2,
                            n_jobs=4,
                            class_weight=None)

In [6]:
# set up parameter search grid
# including class weight

param_grid = {
  'n_estimators': [10, 50, 100],
  'max_depth': [None, 2, 3],
  'class_weight': [None, {-1:1, 1:10}, {-1:1, 1:100}],
}

In [7]:
search = GridSearchCV(estimator=rf,
                      scoring='roc_auc',
                      param_grid=param_grid,
                      cv=2,
                     ).fit(X_train, y_train)

In [8]:
search.best_score_

0.9796239727508651

In [9]:
search.best_params_

{'class_weight': {-1: 1, 1: 100}, 'max_depth': 3, 'n_estimators': 50}

In [10]:
search.best_estimator_

In [11]:
search.score(X_test, y_test)

0.9785521885521886

**HOMEWORK**

Try other machine learning algorithms and other datasets available in imbalanced-learn