## Misclassification cost as part of training

There are 2 ways in which we can introduce cost into the learning function of the algorithm with Scikit-learn:

- Defining the **class_weight** parameter for those estimators that allow it, when we set the estimator
- Passing a **sample_weight** vector with the weights for every single observation, when we fit the estimator.


With both the **class_weight** parameter or the **sample_weight** vector, we indicate that the loss function should be modified to accommodate the class imbalance and the cost attributed to each misclassification.

## parameters

**class_weight**: can take 'balanced' as argument, in which case it will use the balance ratio as weight. Alternatively, it can take a dictionary with {class: penalty}, pairs. In this case, it penalizes mistakes in samples of class[i] with class_weight[i].

So if class_weight = {0:1, and 1:10}, misclassification of observations of class 1 are penalized 10 times more than misclassification of observations of class 0.

**sample_weight** is a vector of the same length as y, containing the weight or penalty for each individual observation. In principle, it is more flexible, because it allows us to set weights to the observations and not to the class as a whole. So in this case, for example we could set up higher penalties for fraudulent applications that are more costly (money-wise)than to those fraudulent applications that are of little money.

## Important

If you use both class_weight and sample_weight, the final penalty will be **the combination of the 2**, so be very careful

## Demo

In this demo, I will introduce cost-sensitive learning to Logistic Regression. But keep in mind that you can do the same with almost every other classifier in Scikit-learn using **sample_weight** or, using **Class_weight** in those estimators that have that attribute.

## Classifiers that support class_weight

In [1]:
# Let's find out which classifiers from sklearn support class_weight
# as part of the __init__ method, that is, when we set the m up

from sklearn.utils.testing import all_estimators

estimators = all_estimators(type_filter='classifier')

for name, class_ in estimators:
    try:
        if hasattr(class_(), 'class_weight'):
            print(name)
    except:
        pass



DecisionTreeClassifier
ExtraTreeClassifier
ExtraTreesClassifier
LinearSVC
LogisticRegression
LogisticRegressionCV
NuSVC
PassiveAggressiveClassifier
Perceptron
RandomForestClassifier
RidgeClassifier
RidgeClassifierCV
SGDClassifier
SVC


Not all classifiers support class_weight. For those which don't, like GradientBoostingClassifier, we can still use sample_weight when we fit the estimator.

## Demo

In this demo, we are going to introduce the misclassification cost in Logistic Regression, using class_weight and then sample_weight.

In [2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

In [3]:
# load data
# only a few observations to speed the computaton

data = pd.read_csv('../kdd2004.csv').sample(10000)

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
95237,53.1,22.99,-0.88,-27.5,-2.5,851.8,-0.12,-1.25,-32.5,-60.0,...,916.6,-0.67,-0.14,1.0,-25.0,-14.3,1.58,0.2,-0.42,-1
48631,53.81,22.32,-1.2,-22.5,38.5,2449.8,-1.57,0.67,11.5,-89.0,...,2151.4,0.28,2.31,8.0,-58.0,461.0,0.44,0.36,0.49,-1
122697,59.18,22.38,0.34,-54.5,89.5,4512.1,-1.15,-0.7,-24.0,-72.5,...,5428.1,0.15,0.3,-17.0,-124.0,276.8,2.04,0.23,-0.16,-1
69124,86.4,27.31,-0.01,-14.0,-42.0,2344.7,-1.03,-1.12,-56.5,-88.0,...,1226.4,0.96,-0.26,13.0,-75.0,784.0,-0.34,0.04,0.5,-1
61900,43.55,26.79,-0.83,-45.0,38.0,1114.4,0.03,-0.89,4.5,-45.5,...,1377.9,-0.45,-0.38,-3.0,-35.0,684.8,-0.95,0.24,0.82,-1


In [4]:
# imbalanced target

data.target.value_counts() / len(data)

-1    0.9896
 1    0.0104
Name: target, dtype: float64

In [5]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

## Using class_weight

In [6]:
# Logistic Regression with class_weight

# we initialize the cost / weights when we set up the transformer

def run_Logit(X_train, X_test, y_train, y_test, class_weight):
    
    # weights introduced here
    logit = LogisticRegression(
        penalty='l2',
        solver='newton-cg',
        random_state=0,
        max_iter=10,
        n_jobs=4,
        class_weight=class_weight # weights / cost
    )
    
    logit.fit(X_train, y_train)

    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [7]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight=None)

Train set
Random Forests roc-auc: 0.9400907029478458
Test set
Random Forests roc-auc: 0.9471163381063821


In [8]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# 'balanced' indicates that we want same amount of 
# each observation, thus, imbalance ratio

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight='balanced')

Train set
Random Forests roc-auc: 0.9775468975468975
Test set
Random Forests roc-auc: 0.9627444369521241


In [9]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# alternatively, we can pass a different cost
# in a dictionary, if we know it already

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:10})

Train set
Random Forests roc-auc: 0.9539476396619254
Test set
Random Forests roc-auc: 0.9500317321803974


Play with the cost and see what you get in terms of performance.

## Using sample_weight

In [10]:
# Logistic Regression + sample_weight

# we pass the weights / cost, when we train the algorithm

def run_Logit(X_train, X_test, y_train, y_test, sample_weight):
    
    logit = LogisticRegression(
        penalty='l2',
        solver='newton-cg',
        random_state=0,
        max_iter=10,
        n_jobs=4,
    )
    
    # costs are passed here
    logit.fit(X_train, y_train, sample_weight=sample_weight)

    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [11]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          sample_weight=None)

Train set
Random Forests roc-auc: 0.9400907029478458
Test set
Random Forests roc-auc: 0.9471163381063821


In [12]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# with numpy.where, we introduce a cost of 99 to
# each observation of the minority class, and 1
# otherwise.

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          sample_weight=np.where(y_train==1,99,1))

Train set
Random Forests roc-auc: 0.9775468975468975
Test set
Random Forests roc-auc: 0.9627444369521241


## Conclusion

Cost-sensitive learning has improved the performance of the model.

**HOMEWORK**

Try other machine learning algorithms and other datasets available in imbalanced-learn