## Handling Imbalanced Dataset with Machine Learning

In [1]:
#pip install imbalanced-learn

In [2]:
import pandas as pd 

### What is Imbalanced Data?
Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. For example, we may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2. This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

We can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either. The remaining discussions will assume a two-class classification problem because it is easier to think about and describe.

#### Imbalance is Common
Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter. There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class. Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class).

In [3]:
df=pd.read_csv('https://raw.githubusercontent.com/ujwal-sah/my_tutorials/master/Feature%20Engineering/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
df.shape

(275000, 31)

In [5]:
df['Class'].value_counts(normalize=True)

0    0.99824
1    0.00176
Name: Class, dtype: float64

In [6]:
# Independent and Dependent Features
X=df.drop("Class",axis=1)
y=df.Class

In [7]:
from collections import Counter
Counter(y)

Counter({0: 274516, 1: 484})

#### Under Sampling

This is as intuitive as it sounds. Undersampling is the process where you randomly remove some of the observations from the majority class in order to match the numbers with the minority class.


In [8]:
from imblearn.under_sampling import NearMiss

In [9]:
ns=NearMiss(sampling_strategy=0.8)

In [10]:
X_ns, y_ns = ns.fit_sample(X, y)
print("The number of classes before fit: \n {}".format(Counter(y)))
print("The number of classes after fit:\n {}".format(Counter(y_ns)))

The number of classes before fit: 
 Counter({0: 274516, 1: 484})
The number of classes after fit:
 Counter({0: 605, 1: 484})


In [11]:
0.8*605

484.0

#### Advantages
- Run-time can be improved by decreasing the amount of training dataset.
- Helps in solving the memory problems

#### Disadvantages
- Losing some critical information

#### Over Sampling

This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.

Over-sampling increases the number of minority class members in the training set by repeating the same data. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to over fitting.

In [12]:
from imblearn.over_sampling import RandomOverSampler

In [13]:
os = RandomOverSampler(sampling_strategy=0.8)

In [14]:
X_os, y_os = os.fit_sample(X,y)
print("The number of classes before fit {}".format(Counter(y)))
print("The number of classes after fit {}".format(Counter(y_os)))

The number of classes before fit Counter({0: 274516, 1: 484})
The number of classes after fit Counter({0: 274516, 1: 219612})


In [15]:
274516*0.8

219612.80000000002

#### Advantages
- No loss of information

#### Disadvantages
- Overfitting

#### SMOTETomek

This process is a little more complicated than simply oversampling or undersampling. It is the process of generating synthetic data that tries to randomly generate a sample of the attributes from observations in the minority class. There are a number of methods used to oversample a dataset for a typical classification problem. The most common technique is called SMOTE (Synthetic Minority Over-sampling Technique). In simple terms, it looks at the feature space for the minority class data points and considers its k nearest neighbours.

Note that SMOTE genereates new data while oversampling just repeats already available data.

In [16]:
from imblearn.combine import SMOTETomek

In [17]:
st = SMOTETomek(sampling_strategy=0.8)

In [18]:
X_st, y_st = st.fit_sample(X, y)
print("The number of classes before fit {}".format(Counter(y)))
print("The number of classes after fit {}".format(Counter(y_st)))

The number of classes before fit Counter({0: 274516, 1: 484})
The number of classes after fit Counter({0: 273950, 1: 219046})


#### Ensemble Techniques

The ensemble-based method is another technique which is used to deal with imbalanced data sets, and the ensemble technique is combined the result or performance of several classiﬁers to improve the performance of single classiﬁer. This method modifies the generalisation ability of individual classifiers by assembling various classifiers. It mainly combines the outputs of multiple base learners. There are various approaches in ensemble learning such as Bagging, Boosting, etc.

#### Advantages
- This is a more stable model
- The prediction is better

#### Cross-Validation and Hyperparameter Tuning

Cross-validation and hyperparameter tuning can be used directly on machine learning model or on machine learning model along with a method we used above. It will always help to improve the performance of our model.

### Use the right evaluation metrics:

Accuracy score is not much useful in classification problem with class imbalance since Null accuracy (always predicting major class) is extremely high. Evaluation metrics can be applied such as:

- Confusion Matrix: a table showing correct predictions and types of incorrect predictions.

- Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.

- Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.

- F1-Score: the weighted average of precision and recall.