## Understanding Imbalanced Data

The Core Issue: In an ideal world, your dataset for a classification problem would have a roughly equal distribution of examples for each class you're trying to predict. An imbalanced dataset means that one or more classes (the "majority" classes) have significantly more examples than others (the "minority" classes).

Why It's a Problem: Machine learning algorithms inherently like to find patterns. When there's a huge imbalance, the algorithm may "learn" to simply favor the majority class to boost its accuracy score. This leads it to neglect the minority class, even though that class might be the one we care most about (e.g., detecting fraudulent transactions).


## Hands-On Learning

Let's work with a dataset to make this concrete. We'll use a hypothetical example for credit card fraud detection. Download dataset from [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download)


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)
from imblearn.over_sampling import RandomOverSampler

In [4]:
creditcard = pd.read_csv("creditcard.csv")
creditcard.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Modeling Without Addressing Imbalance (My Turn)

Let's do this part initially to show the problem:

* We'll split the data into features (X) and target (y).
* We'll train a simple model (like Logistic Regression) on this data.
* We'll evaluate the model. Notice the deceptively high accuracy!

#### Let's check the value counts for each target label to get an idea of what problem we are delaing with

In [None]:
creditcard["Class"].value_counts()

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score 

# Split into features and target
X = creditcard.drop('Class', axis=1) 
y = creditcard['Class']

# Split into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) 

# Instantiate a Logistic Regression model
model = LogisticRegression(max_iter=10000)

# Fit the model without handling imbalance
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-Score:', f1_score(y_test, y_pred)) 

Accuracy: 0.9991994606893064
Precision: 0.8414634146341463
Recall: 0.6106194690265486
F1-Score: 0.7076923076923077


### Let's analyze these results in the context of our imbalanced fraud detection task:

* __Accuracy (0.999199)__:  Seems fantastic, right? But this is highly misleading in scenarios with imbalanced classes. The model could be mostly predicting the majority class (normal transactions) and still get this score.

* __Precision (0.841463)__: A respectable precision score tells us that approximately 84% of transactions the model flagged as fraudulent are actually fraudulent. Note this does depend on the prevalence of fraud in the test set.

* __Recall (0.610619)__:  This is where it gets worrisome. Recall tells us how many of the actual fraudulent transactions our model successfully detected. In this case, we're missing nearly 40% of the fraudulent cases!  This might lead to significant financial losses.

* __F1-Score (0.707692)__: The F1-score is a harmonic mean between precision and recall, giving a balanced view.  A higher F1-score generally indicates a better overall model performance in imbalanced situations.

__Key Takeaway__

* The issue with imbalanced data is clear. It's easy for a model to get "lazy" and prioritize the majority class, often sacrificing its ability to accurately detect the minority class, even when that minority class (fraudulent transactions) is the one we actually care about the most.
* Precision is majorly not a good metric for such problem where minority class is costly.

## __What Now?__

This analysis emphasizes the need to address the class imbalance directly. Let's move on to try the following techniques:

Oversampling:

* Use RandomOverSampler from the imblearn library to generate more synthetic instances of fraudulent transactions.
Undersampling:

* Use RandomUnderSampler to reduce the number of normal transactions.

* Employ SMOTE (Synthetic Minority Oversampling Technique) for a more sophisticated approach to generating synthetic minority examples.

In [20]:
# Split into features and target
X = creditcard.drop("Class", axis=1)
y = creditcard["Class"]

# Define the oversampler
oversampler = RandomOverSampler(random_state=42)

# Oversample and split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
X_train, y_train = oversampler.fit_resample(X_train, y_train)


# Instantiate a Logistic Regression model
model = LogisticRegression(max_iter=10000)

# Fit the model without handling imbalance
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))

Accuracy: 0.9769809836802337
Precision: 0.06286043829296424
Recall: 0.8861788617886179
F1-Score: 0.11739364566505116


### __Let's analyze the changes caused by oversampling:__

__Changes__
* __Recall__: A substantial increase! Our model catches roughly 89% of actual fraudulent transactions compared to the previous 61%. This is exactly what we wanted.
* __Precision__: A significant decrease. Now, a greater proportion of transactions flagged as fraudulent are actually false alarms.
* __Accuracy:__ Deceptive decrease. Oversampling has made the problem harder for the classifier in terms of the overall dataset as it's no longer dominated by one class.
* __F1-score__: This also decreased, reflecting the trade-off between precision and recall.

### __The Trade-Off__

This illustrates a common theme with imbalanced datasets: it's a balancing act. By oversampling, we've:

* __Improved Sensitivity:__ The model identifies fraud better, important for our business case.
* __Lost Specificity:__ More normal transactions are falsely flagged, possibly a nuisance.


## __What Does This Mean?__

There's no universally "correct" answer. What's better depends on our priorities:

* __False Negatives are Very Costly:__ In a high-risk fraud scenario, we might accept having more false positives if it prevents large financial losses due to missed fraud.

* __False Positives Create Customer Friction:__ If unnecessarily blocking customer transactions is bad for the business model, we might favor more precision-focused solutions.