# Oversampling Techniques for Imbalanced Datasets

In this notebook, we explore various **oversampling techniques** that are commonly used to handle imbalanced datasets, especially in classification tasks. These techniques aim to balance the class distribution by increasing the number of samples in the minority class.

## 1. **Synthetic Minority Oversampling Technique (SMOTE)**
SMOTE is a popular oversampling technique that generates synthetic samples for the minority class by interpolating between existing samples. It helps in creating more diverse samples, reducing overfitting in imbalanced datasets.

## 2. **Adaptive Synthetic Sampling (AdaSyn)**
AdaSyn is an enhancement of SMOTE that focuses on generating more synthetic samples near the decision boundary of the minority class. It prioritizes difficult-to-learn examples, making the classifier's decision boundary more accurate.

## 3. **Random Oversampling (ROS)**
ROS is a simpler technique where samples from the minority class are randomly duplicated until the dataset achieves a more balanced distribution. While easy to implement, this method may lead to overfitting due to the duplication of minority class samples.

---
These techniques are crucial when dealing with imbalanced datasets, as they help improve the model's performance by providing more examples of the minority class.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/archive/creditcard.csv')

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
fraud = df[df['Class'] == 1]
legit = df[df['Class'] == 0]

print('Fraudulent Transactions:', len(fraud))
print('Legitimate Transactions:', len(legit))

Fraudulent Transactions: 492
Legitimate Transactions: 284315


In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split # Import train_test_split


df.iloc[:, 1:30] = StandardScaler().fit_transform(df.iloc[:, 1:30])

fraud_train, fraud_test = train_test_split(fraud, test_size=0.3, random_state=42)
legit_train, legit_test = train_test_split(legit, test_size=0.3, random_state=42)

train_data = pd.concat([fraud_train, legit_train])
train_data = train_data.sample(frac=1, random_state=42).reset_index(drop=True)
print('Training Data Shape:', train_data.shape)

print('Fraud Data test Shape: ', fraud_test.shape)
print('Legitimate Data test Shape: ', legit_test.shape)

train_data = train_data.values
fraud_test = fraud_test.values
legit_test = legit_test.values

x_train = train_data[:, 1:30]
y_train = train_data[:, 30]


x_fraud_test = fraud_test[:, 1:30]
y_fraud_test = fraud_test[:, 30]

x_legit_test = legit_test[:, 1:30]
y_legit_test = legit_test[:, 30]

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_fraud_test = scaler.transform(x_fraud_test)
x_legit_test = scaler.transform(x_legit_test)

print('X_train Shape:', x_train.shape)
print('Y_train Shape:', y_train.shape)
print('X_fraud_test Shape:', x_fraud_test.shape)
print('Y_fraud_test Shape:', y_fraud_test.shape)
print('X_legit_test Shape:', x_legit_test.shape)
print('Y_legit_test Shape:', y_legit_test.shape)

Training Data Shape: (199364, 31)
Fraud Data test Shape:  (148, 31)
Legitimate Data test Shape:  (85295, 31)
X_train Shape: (199364, 29)
Y_train Shape: (199364,)
X_fraud_test Shape: (148, 29)
Y_fraud_test Shape: (148,)
X_legit_test Shape: (85295, 29)
Y_legit_test Shape: (85295,)


In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

def evaluate_model(model, x_train, y_train, x_legit_test, y_legit_test, x_fraud_test, y_fraud_test):
    """
    Evaluates a given model on training and test data.

    Args:
        model: The machine learning model to evaluate.
        x_train, y_train: Training data and labels.
        x_legit_test, y_legit_test: Test data and labels for legitimate transactions.
        x_fraud_test, y_fraud_test: Test data and labels for fraudulent transactions.

    Returns:
        A dictionary containing:
            - Legitimate accuracy
            - Fraud accuracy
    """
    # Train the model
    model.fit(x_train, y_train)

    # Evaluate accuracies
    accuracy_legit = model.score(x_legit_test, y_legit_test)
    accuracy_fraud = model.score(x_fraud_test, y_fraud_test)

    return {
        "accuracy_legit": accuracy_legit,
        "accuracy_fraud": accuracy_fraud
    }


models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=4, random_state=42),
    "SVM": SVC(kernel='rbf', random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=50, random_state=42),
    "GaussianNB": GaussianNB(),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
}

In [None]:
#using SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_train_smote, y_train_smote = sm.fit_resample(x_train, y_train)

results = {}
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    results[model_name] = evaluate_model(model, x_train_smote, y_train_smote, x_legit_test, y_legit_test, x_fraud_test, y_fraud_test)
    print(f"{model_name} Legit Accuracy: {results[model_name]['accuracy_legit']:.5f}")
    print(f"{model_name} Fraud Accuracy: {results[model_name]['accuracy_fraud']:.5f}")

Evaluating Decision Tree...
Decision Tree Legit Accuracy: 0.94620
Decision Tree Fraud Accuracy: 0.91892
Evaluating SVM...


In [22]:
results

{'Decision Tree': {'accuracy_legit': 0.9461984876018524,
  'accuracy_fraud': 0.918918918918919},
 'SVM': {'accuracy_legit': 0.9843601617914297,
  'accuracy_fraud': 0.8851351351351351},
 'Random Forest': {'accuracy_legit': 0.9997537956503898,
  'accuracy_fraud': 0.8243243243243243},
 'Logistic Regression': {'accuracy_legit': 0.9741719913242277,
  'accuracy_fraud': 0.9324324324324325},
 'AdaBoost': {'accuracy_legit': 0.967758954217715,
  'accuracy_fraud': 0.918918918918919},
 'GaussianNB': {'accuracy_legit': 0.9760361099712762,
  'accuracy_fraud': 0.8648648648648649},
 'KNN': {'accuracy_legit': 0.9984055337358579,
  'accuracy_fraud': 0.8581081081081081},
 'XGBoost': {'accuracy_legit': 0.9993669031010024,
  'accuracy_fraud': 0.8581081081081081}}

In [9]:
# Using AdaSyn
from imblearn.over_sampling import ADASYN

# Apply AdaSyn for oversampling
adasyn = ADASYN(random_state=42)
x_train_adasyn, y_train_adasyn = adasyn.fit_resample(x_train, y_train)

results_adasyn = {}
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    results_adasyn[model_name] = evaluate_model(model, x_train_adasyn, y_train_adasyn, x_legit_test, y_legit_test, x_fraud_test, y_fraud_test)
    print(f"{model_name} Legit Accuracy: {results_adasyn[model_name]['accuracy_legit']:.5f}")
    print(f"{model_name} Fraud Accuracy: {results_adasyn[model_name]['accuracy_fraud']:.5f}")


Evaluating Decision Tree...
Decision Tree Legit Accuracy: 0.86322
Decision Tree Fraud Accuracy: 0.95270
Evaluating SVM...
SVM Legit Accuracy: 0.98554
SVM Fraud Accuracy: 0.75676
Evaluating Random Forest...
Random Forest Legit Accuracy: 0.99973
Random Forest Fraud Accuracy: 0.83108
Evaluating Logistic Regression...
Logistic Regression Legit Accuracy: 0.91022
Logistic Regression Fraud Accuracy: 0.94595
Evaluating AdaBoost...
AdaBoost Legit Accuracy: 0.91182
AdaBoost Fraud Accuracy: 0.92568
Evaluating GaussianNB...
GaussianNB Legit Accuracy: 0.96194
GaussianNB Fraud Accuracy: 0.88514
Evaluating KNN...
KNN Legit Accuracy: 0.99841
KNN Fraud Accuracy: 0.85811
Evaluating XGBoost...


Parameters: { "use_label_encoder" } are not used.



XGBoost Legit Accuracy: 0.99927
XGBoost Fraud Accuracy: 0.85135


In [10]:
results_adasyn

{'Decision Tree': {'accuracy_legit': 0.8632158977665748,
  'accuracy_fraud': 0.9527027027027027},
 'SVM': {'accuracy_legit': 0.9855442874728882,
  'accuracy_fraud': 0.7567567567567568},
 'Random Forest': {'accuracy_legit': 0.9997303476170936,
  'accuracy_fraud': 0.831081081081081},
 'Logistic Regression': {'accuracy_legit': 0.9102174805088223,
  'accuracy_fraud': 0.9459459459459459},
 'AdaBoost': {'accuracy_legit': 0.9118236707896126,
  'accuracy_fraud': 0.9256756756756757},
 'GaussianNB': {'accuracy_legit': 0.9619438419602556,
  'accuracy_fraud': 0.8851351351351351},
 'KNN': {'accuracy_legit': 0.9984055337358579,
  'accuracy_fraud': 0.8581081081081081},
 'XGBoost': {'accuracy_legit': 0.9992731109678176,
  'accuracy_fraud': 0.8513513513513513}}

In [11]:
# Using Random Oversampling (ROS)
from imblearn.over_sampling import RandomOverSampler

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
x_train_ros, y_train_ros = ros.fit_resample(x_train, y_train)

# Evaluate models on the ROS-balanced dataset
results_ros = {}
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    results_ros[model_name] = evaluate_model(model, x_train_ros, y_train_ros, x_legit_test, y_legit_test, x_fraud_test, y_fraud_test)
    print(f"{model_name} Legit Accuracy: {results_ros[model_name]['accuracy_legit']:.5f}")
    print(f"{model_name} Fraud Accuracy: {results_ros[model_name]['accuracy_fraud']:.5f}")


Evaluating Decision Tree...
Decision Tree Legit Accuracy: 0.96186
Decision Tree Fraud Accuracy: 0.88514
Evaluating SVM...
SVM Legit Accuracy: 0.98658
SVM Fraud Accuracy: 0.90541
Evaluating Random Forest...
Random Forest Legit Accuracy: 0.99988
Random Forest Fraud Accuracy: 0.82432
Evaluating Logistic Regression...
Logistic Regression Legit Accuracy: 0.97607
Logistic Regression Fraud Accuracy: 0.93243
Evaluating AdaBoost...
AdaBoost Legit Accuracy: 0.97692
AdaBoost Fraud Accuracy: 0.89189
Evaluating GaussianNB...
GaussianNB Legit Accuracy: 0.97404
GaussianNB Fraud Accuracy: 0.85811
Evaluating KNN...
KNN Legit Accuracy: 0.99938
KNN Fraud Accuracy: 0.82432
Evaluating XGBoost...


Parameters: { "use_label_encoder" } are not used.



XGBoost Legit Accuracy: 0.99982
XGBoost Fraud Accuracy: 0.83108
