Imbalanced data occurs when the datasets are distributed unequally i.e. when the frequency of data points or the number of rows in one class is much more than in other classes, then the data is imbalanced.

Most ML algorithms are designed to improve accuracy/reduce errors and don’t consider the distribution of classes. Also, standard machine learning algorithms like Decision trees and Logistic Regression have a bias toward Majority classes and tend to ignore minority classes. So in these cases, even though the model has 95% accuracy, it cannot be said as a perfect model as the frequency of the number of classes in testing data may be 95%, and 5% wrongly predicted data must be from the minority class.

#### Imbalanced Data Handling Techniques 

Sampling is used to handle imbalanced data.

2 types of sampling:
   - Under Sampling 
       - samples are removed from the majority class
   - Over Sampling 
       - samples are added to the minority class

#### Load data

In [1]:
import pandas as pd 
import numpy as np

df = pd.read_csv("creditcard.csv")
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
#check if any values are null
df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [3]:
#check number of rows and columns
df.shape

(284807, 31)

In [4]:
#check descriptive statistics
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.918649e-15,5.682686e-16,-8.761736e-15,2.811118e-15,-1.552103e-15,2.04013e-15,-1.698953e-15,-1.893285e-16,-3.14764e-15,...,1.47312e-16,8.042109e-16,5.282512e-16,4.456271e-15,1.426896e-15,1.70164e-15,-3.662252e-16,-1.217809e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [5]:
#normalize the amount column
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
#df['normalisedamount']=sc.fit_transform(df['Amount'])
#ValueError: Expected 2D array, got 1D array instead:
df['normalisedamount']=sc.fit_transform(np.array(df.Amount).reshape(-1,1))

In [6]:
df.Amount.shape

(284807,)

In [7]:
np.array(df.Amount)

array([149.62,   2.69, 378.66, ...,  67.88,  10.  , 217.  ])

In [8]:
np.array(df.Amount).reshape(-1,1)

array([[149.62],
       [  2.69],
       [378.66],
       ...,
       [ 67.88],
       [ 10.  ],
       [217.  ]])

In [9]:
df['normalisedamount'].head()

0    0.244964
1   -0.342475
2    1.160686
3    0.140534
4   -0.073403
Name: normalisedamount, dtype: float64

In [10]:
#drop columns not needed - Time, Amount
df=df.drop(['Time','Amount'],axis = 1)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normalisedamount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073403


In [11]:
#check count of target variable - Class
df.Class.unique()

array([0, 1], dtype=int64)

In [12]:
#there are 492 fraud transactions
df.Class.value_counts()


0    284315
1       492
Name: Class, dtype: int64

In [13]:
#make a copy of the original data
import copy
new_df = copy.deepcopy(df)
new_df.shape

(284807, 30)

In [14]:
#separate class 0 and 1
updated_df0 = new_df[new_df['Class']==0]
updated_df1 = new_df[new_df['Class']==1]
print(updated_df0.shape,updated_df1.shape)

(284315, 30) (492, 30)


In [15]:
#take a fraction of class 0, as its too big a dataset
#updated_df0.sample(n=2000)
updated_df0= updated_df0.sample(frac=0.125)
updated_df0.shape

(35539, 30)

In [16]:
#combine data 
updated_df = pd.concat([updated_df0, updated_df1])
updated_df['Class'].value_counts()

0    35539
1      492
Name: Class, dtype: int64

In [17]:
#separate independent and target variables
X = updated_df.drop(['Class'],axis=1)
X.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,normalisedamount
194393,2.113735,-0.132777,-1.372156,0.262048,0.108523,-0.842061,0.102106,-0.235471,0.735209,0.103488,...,-0.267495,-0.346885,-0.904432,0.267954,-0.681246,-0.238148,0.242214,-0.077608,-0.06933,-0.353229
114530,-1.170082,0.291851,1.953618,0.09736,0.64009,-0.819238,1.126022,-0.208681,-0.825598,-0.404722,...,0.264939,0.181942,0.31074,-0.14656,0.52559,0.659111,-0.463744,-0.182442,-0.15306,0.009077
93466,-0.620416,0.56977,1.408236,1.130662,0.802418,-0.543023,0.386159,-0.012025,-0.784483,-0.180098,...,0.18327,0.220458,0.607584,-0.085641,0.109243,-0.161772,-0.229601,0.16263,0.155644,-0.325243
225997,2.147257,-1.204425,-1.799627,-1.933752,1.306594,3.6478,-1.598668,0.974291,-0.080944,0.727025,...,0.044511,-0.046234,-0.227332,0.42715,0.67577,-0.458835,-0.464307,0.032637,-0.04666,-0.315687
40213,-1.95993,-0.584519,0.225508,0.119983,-0.411616,-0.334624,2.33278,-0.713595,-0.852094,-0.420824,...,-0.165368,-0.151391,-0.058894,-0.307782,0.042762,0.50854,1.228023,-0.085464,0.218639,1.513637


In [18]:
y=updated_df['Class']
y.head()

194393    0
114530    0
93466     0
225997    0
40213     0
Name: Class, dtype: int64

#### Split the data into train and test

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print((X_train.shape,y_train.shape),(X_test.shape,y_test.shape))

((28824, 29), (28824,)) ((7207, 29), (7207,))


#### Fit a Logistic regression 

In [20]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
y_pred


array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [21]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc = accuracy_score(y_pred,y_test)

rec = recall_score(y_test,y_pred)

prec = precision_score(y_test,y_pred)

print("Accuracy={}, Recall_score={}, Precision={}".format(acc,rec,prec))

Accuracy=0.9969474122381018, Recall_score=0.8108108108108109, Precision=0.989010989010989


Accuracy score is 99.7%, we can see that the Recall score is 82% which is relatively low, and the Precision score is 97.8%. This is because the dataset is imbalanced.

#### 1)Handling Imbalanced Data using Under Sampling

The idea behind sampling is to create new samples or choose some records from the whole data set.

Under sampling involves the removal of records from the majority class to balance out with the minority class.
This helps in balancing the class distribution for an already skewed class distribution. The basic Undersampling technique removes the examples randomly from the majority class, referred to as ‘randomundersampling.’ Although this is simple and sometimes effective too, there is a risk of losing useful or important information that could determine the decision boundary between the classes.

Therefore, there is a need for a more heuristic approach that can choose examples for non-deletion and redundant examples for deletion. A few undersampling techniques are as below.

##### Methods that Select Examples to Keep - techniques that choose which examples from the majority class to keep

##### A)Near Miss Undersampling 

This technique selects the data points based on the distance between majority and minority class examples. 
It has three versions of itself, and each of these considers the different neighbors from the majority class.

- Version 1 keeps examples with a minimum average distance to the nearest records of the minority class.
- Version 2 selects rows with a minimum average distance to the furthest records of the minority class.
- Version 3 keeps examples from the majority class for each closest record in the minority class.

Among these, version 3 is more accurate since it considers examples of the majority class that are on the decision boundary.

- The type of near-miss strategy used is defined by the “version” argument, which by default is set to 1 for NearMiss-1, but can be set to 2 or 3 for the other two methods.

- By default, the technique will undersample the majority class to have the same number of examples as the minority class, although this can be changed by setting the sampling_strategy argument to a fraction of the minority class.

- n_neighbors = 3, selects only those majority class examples that have a minimum distance to three minority class instances.

In [22]:
from imblearn.under_sampling import NearMiss

In [23]:
undersampling = NearMiss(version=1,n_neighbors=3)
undersampling

In [24]:
print(X_train.shape,y_train.shape)

(28824, 29) (28824,)


In [25]:
X_train_miss,y_train_miss = undersampling.fit_resample(X_train,y_train)

In [26]:
print(X_train_miss.shape,y_train_miss.shape)

(762, 29) (762,)


In [27]:
#after undersampling, counts of label '1'
sum(y_train_miss==1)

381

In [28]:
#after undersampling, counts of label '0'
sum(y_train_miss==0)

381

We have undersampled the majority class — 0 and balanced it out with the minority class — 1.

In [29]:
lr2 = LogisticRegression(max_iter=1000)
lr2.fit(X_train_miss, y_train_miss)

In [30]:
y_pred2 = lr2.predict(X_test)

In [31]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc2 = accuracy_score(y_pred2,y_test)

rec2 = recall_score(y_test,y_pred2)

prec2 = precision_score(y_test,y_pred2)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc2,rec2,prec2))

Accuracy=0.6872485083946164, Recall_score=0.9819819819819819, Precision=0.04616687844133842


The recall_score is higher, but the accuracy and precision is less. When the prediction of minority class is a priority, this technique is used.

##### B)Condensed Nearest Neighbor (CNN) Undersampling

- This technique targets for a subset of a collection of samples that results in no loss in model performance, also referred to as a minimum consistent set.

- It is achieved by enumerating the dataset and adding them to the ‘group’ only if they cannot be classified correctly by the current contents. This approach was proposed to reduce the memory requirements for the k-Nearest Neighbors (KNN) algorithm.

- When CNN is applied, the group is comprised of all examples in the minority set and only examples from the majority set that cannot be classified correctly are added incrementally to the group.

- During the procedure, the KNN algorithm is used to classify points to determine if they are to be added to the store or not. The k value is set via the n_neighbors argument and defaults to 1.

- It’s a relatively slow procedure, so small datasets and small k values are preferred. 

- CNN removes redundant instances.

In [32]:
from imblearn.under_sampling import CondensedNearestNeighbour

undersample2 = CondensedNearestNeighbour(n_neighbors=1)
undersample2

In [33]:
print(X_train.shape,y_train.shape)
#before undersampling, counts of label '1'
print(sum(y_train==1))

#before undersampling, counts of label '0'
print(sum(y_train==0))

X_train_cnn,y_train_cnn = undersample2.fit_resample(X_train,y_train)
print(X_train_cnn.shape,y_train_cnn.shape)

#after undersampling, counts of label '1'
print(sum(y_train_cnn==1))

#after undersampling, counts of label '0'
print(sum(y_train_cnn==0))

(28824, 29) (28824,)
381
28443
(927, 29) (927,)
381
546


In [34]:
lr3 = LogisticRegression(max_iter=1000)
lr3.fit(X_train_cnn, y_train_cnn)
y_pred3 = lr3.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc3 = accuracy_score(y_pred3,y_test)

rec3 = recall_score(y_test,y_pred3)

prec3 = precision_score(y_test,y_pred3)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc3,rec3,prec3))

Accuracy=0.9962536422922159, Recall_score=0.8738738738738738, Precision=0.8818181818181818


##### Methods that Select Examples to Delete - techniques that select examples from the majority class to delete.

##### A)Tomek Links undersampling 

- A criticism of the Condensed Nearest Neighbor Rule is that examples are selected randomly, especially initially.This has the effect of allowing redundant examples into the store and in allowing examples that are internal to the mass of the distribution, rather than on the class boundary, into the store.

- One solution is to find pairs of examples, one from each class; that together have the smallest Euclidean distance to each other in feature space. This means that in a binary classification problem with classes 0 and 1, a pair would have an example from each class and would be closest neighbors across the dataset.

- These cross-class pairs are now generally referred to as “Tomek Links” and are valuable as they define the class boundary.

- The procedure for finding Tomek Links can be used to locate all cross-class nearest neighbors. If the examples in the minority class are held constant, the procedure can be used to find all of those examples in the majority class that are closest to the minority class, then removed. These would be the ambiguous examples.

- Tomek Links can be said to remove borderline and noisy instances.


In [35]:
from imblearn.under_sampling import TomekLinks

undersample3 = TomekLinks()
undersample3

In [36]:
print(X_train.shape,y_train.shape)
#before undersampling, counts of label '1'
print(sum(y_train==1))

#before undersampling, counts of label '0'
print(sum(y_train==0))

X_train_tom,y_train_tom = undersample3.fit_resample(X_train,y_train)
print(X_train_tom.shape,y_train_tom.shape)

#after undersampling, counts of label '1'
print(sum(y_train_tom==1))

#after undersampling, counts of label '0'
print(sum(y_train_tom==0))

(28824, 29) (28824,)
381
28443
(28807, 29) (28807,)
381
28426


In [37]:
lr4 = LogisticRegression(max_iter=1000)
lr4.fit(X_train_tom, y_train_tom)
y_pred4 = lr4.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc4 = accuracy_score(y_pred4,y_test)

rec4 = recall_score(y_test,y_pred4)

prec4 = precision_score(y_test,y_pred4)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc4,rec4,prec4))

Accuracy=0.9972249202164563, Recall_score=0.8288288288288288, Precision=0.989247311827957


##### B)Edited Nearest Neighbors Rule for Undersampling

- Another rule for finding ambiguous and noisy examples in a dataset is called Edited Nearest Neighbors, or sometimes ENN for short.

- This rule involves using k=3 nearest neighbors to locate those examples in a dataset that are misclassified and that are then removed before a k=1 classification rule is applied. 

- When used as an undersampling procedure, the rule can be applied to each e

- The n_neighbors argument controls the number of neighbors to use in the editing rule, which defaults to three.xample in the majority class, allowing those examples that are misclassified as belonging to the minority class to be removed, and those correctly classified to remain.

- It is also applied to each example in the minority class where those examples that are misclassified have their nearest neighbors from the majority class deleted.

- The n_neighbors argument controls the number of neighbors to use in the editing rule, which defaults to three.

- Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.

In [38]:
from imblearn.under_sampling import EditedNearestNeighbours

undersample4a = EditedNearestNeighbours(n_neighbors=3)
undersample4a

In [39]:
print(X_train.shape,y_train.shape)
#before undersampling, counts of label '1'
print(sum(y_train==1))

#before undersampling, counts of label '0'
print(sum(y_train==0))
X_train_enn,y_train_enn = undersample4a.fit_resample(X_train,y_train)
print(X_train_enn.shape,y_train_enn.shape)

#after undersampling, counts of label '1'
print(sum(y_train_enn==1))

#after undersampling, counts of label '0'
print(sum(y_train_enn==0))

(28824, 29) (28824,)
381
28443
(28704, 29) (28704,)
381
28323


In [40]:
lr4a = LogisticRegression(max_iter=1000)
lr4a.fit(X_train_enn, y_train_enn)
y_pred4a = lr4a.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc4a = accuracy_score(y_pred4a,y_test)

rec4a = recall_score(y_test,y_pred4a)

prec4a = precision_score(y_test,y_pred4a)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc4a,rec4a,prec4a))

Accuracy=0.9975024281948106, Recall_score=0.8558558558558559, Precision=0.979381443298969


##### Combinations of Keep and Delete Methods - methods that combine the techniques to both keep and delete examples from the majority class

##### A)One-Sided Selection for Undersampling

- One-Sided Selection, or OSS for short, is an undersampling technique that combines Tomek Links and the Condensed Nearest Neighbor (CNN) Rule

- The CNN procedure occurs in one-step and involves first adding all minority class examples to the store and some number of majority class examples (e.g. 1), then classifying all remaining majority class examples with KNN (k=1) and adding those that are misclassified to the store.

- The number of seed examples can be set with n_seeds_S and defaults to 1 and the k for KNN can be set via the n_neighbors argument and defaults to 1.

- Given that the CNN procedure occurs in one block, it is more useful to have a larger seed sample of the majority class in order to effectively remove redundant examples. In this case, we will use a value of 200.

- We might expect a large number of redundant examples from the majority class to be removed from the interior of the distribution (e.g. away from the class boundary).

- It might be interesting to explore larger seed samples from the majority class and different values of k used in the one-step CNN procedure.

In [41]:
from imblearn.under_sampling import OneSidedSelection

undersample4b = OneSidedSelection(n_neighbors=1, n_seeds_S=200)
undersample4b

In [42]:
print(X_train.shape,y_train.shape)
#before undersampling, counts of label '1'
print(sum(y_train==1))

#before undersampling, counts of label '0'
print(sum(y_train==0))

X_train_oss,y_train_oss = undersample4b.fit_resample(X_train,y_train)
print(X_train_oss.shape,y_train_oss.shape)

#after undersampling, counts of label '1'
print(sum(y_train_oss==1))

#after undersampling, counts of label '0'
print(sum(y_train_oss==0))

(28824, 29) (28824,)
381
28443
(3584, 29) (3584,)
381
3203


In [43]:
lr4b = LogisticRegression(max_iter=1000)
lr4b.fit(X_train_oss, y_train_oss)
y_pred4b = lr4b.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc4b = accuracy_score(y_pred4b,y_test)

rec4b = recall_score(y_test,y_pred4b)

prec4b = precision_score(y_test,y_pred4b)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc4b,rec4b,prec4b))

Accuracy=0.9973636742056334, Recall_score=0.8288288288288288, Precision=1.0


##### B)Neighborhood Cleaning Undersampling

- Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the Condensed Nearest Neighbor (CNN) Rule to remove redundant examples and the Edited Nearest Neighbors (ENN) Rule to remove noisy or ambiguous examples.

- The approach involves first selecting all examples from the minority class. Then all of the ambiguous examples in the majority class are identified using the ENN rule and removed. Finally, a one-step version of CNN is used where those remaining examples in the majority class that are misclassified against the store are removed, but only if the number of examples in the majority class is larger than half the size of the minority class.

- The number of neighbors used in the ENN and CNN steps can be specified via the n_neighbors argument that defaults to three. The threshold_cleaning controls whether or not the CNN is applied to a given class, which might be useful if there are multiple minority classes with similar sizes. This is kept at 0.5.

In [44]:
from imblearn.under_sampling import NeighbourhoodCleaningRule

undersample4 = NeighbourhoodCleaningRule()
undersample4

In [45]:
print(X_train.shape,y_train.shape)
#before undersampling, counts of label '1'
print(sum(y_train==1))

#before undersampling, counts of label '0'
print(sum(y_train==0))

X_train_ncr,y_train_ncr = undersample4.fit_resample(X_train,y_train)
print(X_train_ncr.shape,y_train_ncr.shape)

#after undersampling, counts of label '1'
print(sum(y_train_ncr==1))

#after undersampling, counts of label '0'
print(sum(y_train_ncr==0))

(28824, 29) (28824,)
381
28443
(28625, 29) (28625,)
381
28244


In [46]:
lr5 = LogisticRegression(max_iter=1000)
lr5.fit(X_train_ncr, y_train_ncr)
y_pred5 = lr5.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc5 = accuracy_score(y_pred5,y_test)

rec5 = recall_score(y_test,y_pred5)

prec5 = precision_score(y_test,y_pred5)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc5,rec5,prec5))

Accuracy=0.9975024281948106, Recall_score=0.8558558558558559, Precision=0.979381443298969


#### 2)Handling Imbalanced Data using Over Sampling

- Oversampling focuses on increasing minority class samples.

- We can also duplicate the examples to increase the minority class samples. Although it balances the data, it does not provide additional information to the classification model.

- Therefore synthesizing new examples using an appropriate technique is necessary. 

##### SMOTE 

- An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling TEchnique, or SMOTE for short.

- SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

- Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.This procedure can be used to create as many synthetic examples for the minority class as are required.

- The approach is effective because new synthetic examples from the minority class are created that are plausible, that is, are relatively close in feature space to existing examples from the minority class.

- A general downside of the approach is that synthetic examples are created without considering the majority class, possibly resulting in ambiguous examples if there is a strong overlap for the classes.

In [47]:
from imblearn.over_sampling import SMOTE

oversample1 = SMOTE()
oversample1

In [48]:
print(X_train.shape,y_train.shape)
#before oversampling, counts of label '1'
print(sum(y_train==1))

#before oversampling, counts of label '0'
print(sum(y_train==0))

X_train_smo,y_train_smo = undersample4.fit_resample(X_train,y_train)
print(X_train_smo.shape,y_train_smo.shape)

#after oversampling, counts of label '1'
print(sum(y_train_smo==1))

#after oversampling, counts of label '0'
print(sum(y_train_smo==0))

(28824, 29) (28824,)
381
28443
(28625, 29) (28625,)
381
28244


In [49]:
lr6 = LogisticRegression(max_iter=1000)
lr6.fit(X_train_smo, y_train_smo)
y_pred6 = lr6.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc6 = accuracy_score(y_pred6,y_test)

rec6 = recall_score(y_test,y_pred6)

prec6 = precision_score(y_test,y_pred6)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc6,rec6,prec6))

Accuracy=0.9975024281948106, Recall_score=0.8558558558558559, Precision=0.979381443298969


##### Borderline-SMOTE

- A popular extension to SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model.

- We can then oversample just those difficult instances, providing more resolution only where it may be required.

- These examples that are misclassified are likely ambiguous and in a region of the edge or border of decision boundary where class membership may overlap. As such, this modified to SMOTE is called Borderline-SMOTE.

- This technique oversamples the majority class for those examples that cause a misclassification of borderline instances in the minority class. This is referred to as Borderline-SMOTE1, whereas the oversampling of just the borderline cases in minority class is referred to as Borderline-SMOTE2.

- Instead of generating new synthetic examples for the minority class blindly, we would expect the Borderline-SMOTE method to only create synthetic examples along the decision boundary between the two classes.

In [50]:
from imblearn.over_sampling import BorderlineSMOTE

oversample2 = BorderlineSMOTE()
oversample2

In [51]:
print(X_train.shape,y_train.shape)
#before oversampling, counts of label '1'
print(sum(y_train==1))

#before oversampling, counts of label '0'
print(sum(y_train==0))

X_train_bls,y_train_bls = oversample2.fit_resample(X_train,y_train)
print(X_train_bls.shape,y_train_bls.shape)

#after oversampling, counts of label '1'
print(sum(y_train_bls==1))

#after oversampling, counts of label '0'
print(sum(y_train_bls==0))

(28824, 29) (28824,)
381
28443
(56886, 29) (56886,)
28443
28443


In [52]:
lr7 = LogisticRegression(max_iter=1000)
lr7.fit(X_train_bls,y_train_bls)
y_pred7 = lr7.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc7 = accuracy_score(y_pred7,y_test)

rec7 = recall_score(y_test,y_pred7)

prec7 = precision_score(y_test,y_pred7)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc7,rec7,prec7))

Accuracy=0.9697516303593728, Recall_score=0.8918918918918919, Precision=0.32459016393442625


##### Borderline-SMOTE SVM

- This is an alternative of Borderline-SMOTE where an SVM algorithm is used instead of a KNN to identify misclassified examples on the decision boundary.SVM is used to locate the decision boundary defined by the support vectors and examples in the minority class that close to the support vectors become the focus for generating synthetic examples.

- In addition to using an SVM, the technique attempts to select regions where there are fewer examples of the minority class and tries to extrapolate towards the class boundary.

In [53]:
from imblearn.over_sampling import SVMSMOTE

oversample3= SVMSMOTE()
oversample3

In [54]:
print(X_train.shape,y_train.shape)
#before oversampling, counts of label '1'
print(sum(y_train==1))

#before oversampling, counts of label '0'
print(sum(y_train==0))

X_train_svm,y_train_svm = oversample3.fit_resample(X_train,y_train)
print(X_train_svm.shape,y_train_svm.shape)

#after oversampling, counts of label '1'
print(sum(y_train_svm==1))

#after oversampling, counts of label '0'
print(sum(y_train_svm==0))

(28824, 29) (28824,)
381
28443
(56886, 29) (56886,)
28443
28443


In [55]:
lr8 = LogisticRegression(max_iter=1000)
lr8.fit(X_train_svm,y_train_svm)
y_pred8 = lr8.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc8 = accuracy_score(y_pred8,y_test)

rec8 = recall_score(y_test,y_pred8)

prec8 = precision_score(y_test,y_pred8)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc8,rec8,prec8))

Accuracy=0.9832107673095601, Recall_score=0.918918918918919, Precision=0.4766355140186916


##### Adaptive Synthetic Sampling (ADASYN)

- Another approach involves generating synthetic samples inversely proportional to the density of the examples in the minority class. That is, generate more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

- This modification to SMOTE is referred to as the Adaptive Synthetic Sampling Method, or ADASYN. 

- In this technique, examples in the minority class are weighted according to their density, then those examples with the lowest density are the focus for the SMOTE synthetic example generation process.

In [56]:
from imblearn.over_sampling import ADASYN

oversample4= ADASYN()
oversample4

In [57]:
print(X_train.shape,y_train.shape)
#before oversampling, counts of label '1'
print(sum(y_train==1))

#before oversampling, counts of label '0'
print(sum(y_train==0))

X_train_ada,y_train_ada = oversample4.fit_resample(X_train,y_train)
print(X_train_ada.shape,y_train_ada.shape)

#after oversampling, counts of label '1'
print(sum(y_train_ada==1))

#after oversampling, counts of label '0'
print(sum(y_train_ada==0))

(28824, 29) (28824,)
381
28443
(56890, 29) (56890,)
28447
28443


In [58]:
lr9 = LogisticRegression(max_iter=1000)
lr9.fit(X_train_ada,y_train_ada)
y_pred9 = lr9.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc9 = accuracy_score(y_pred9,y_test)

rec9 = recall_score(y_test,y_pred9)

prec9 = precision_score(y_test,y_pred9)
print("Aaccuracy={}, Recall_score={}, Precision={}".format(acc9,rec9,prec9))

Aaccuracy=0.8844179270154017, Recall_score=0.954954954954955, Precision=0.11349036402569593


#### 3)Handling imbalanced data using combination of Undersampling and Oversamlpling

- This is a combination of the two most powerful algorithms used for oversampling and undersampling imbalanced datasets, SMOTE and Tomek Links.

- Firstly a Tomek link is created when there exists a pair of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

- Then the Tomek link is used in order to exclude examples from the majority class that are too close to examples from the minority class.

- Then SMOTE is applied to oversample minority class examples.

In [59]:
from imblearn.combine import SMOTETomek

resample = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
resample

In [60]:
print(X_train.shape,y_train.shape)
#before oversampling, counts of label '1'
print(sum(y_train==1))

#before oversampling, counts of label '0'
print(sum(y_train==0))

X_train_com,y_train_com = resample.fit_resample(X_train,y_train)
print(X_train_com.shape,y_train_com.shape)

#after oversampling, counts of label '1'
print(sum(y_train_com==1))

#after oversampling, counts of label '0'
print(sum(y_train_com==0))

(28824, 29) (28824,)
381
28443
(56886, 29) (56886,)
28443
28443


In [61]:
lr10 = LogisticRegression(max_iter=1000)
lr10.fit(X_train_com,y_train_com)
y_pred10 = lr10.predict(X_test)
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, precision_score

acc10 = accuracy_score(y_pred10,y_test)

rec10 = recall_score(y_test,y_pred10)

prec10 = precision_score(y_test,y_pred10)
print("Accuracy={}, Recall_score={}, Precision={}".format(acc10,rec10,prec10))

Accuracy=0.9707229082836132, Recall_score=0.9369369369369369, Precision=0.33766233766233766


#### Conclusion

Which method is best - Undersampling or Oversampling? Mostly the answer to the question depends on various parameters of the problem - how much data we’re working with, how our data is distributed, and in particular, how much we can tolerate false positives vs. false negatives.

Oversampling methods duplicate or create new synthetic examples in the minority class, whereas undersampling methods delete or merge examples in the majority class. Both types of resampling can be effective when used in isolation, although can be more effective when both types of methods are used together. Also outliers should be removed before applying any of the techniques described above.