<a href="https://colab.research.google.com/github/wooihaw/imbalanced_classification/blob/main/imbalanced_dataset_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Classification of Imbalanced Dataset

The Statlog (Shuttle) dataset, utilized prominently in machine learning for classification tasks. This dataset, composed of data derived from the space shuttle, is used for predicting the radiator position on the shuttle: whether it is in a state of radiative cooling or not. One of the primary challenges is the imbalanced distribution of classes within the dataset. The majority of the data points belong to one class (approximately 80% of the data belongs to class 1), while other classes are underrepresented. This imbalance can lead to models that are biased towards predicting the majority class, thereby reducing the overall accuracy of classification for the minority classes.s.

In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from pandas import read_csv

In [2]:
# Load your dataset
df_train = read_csv("shuttle_train.csv")
df_test = read_csv("shuttle_test.csv")

In [3]:
# Print 5 random data samples in the training set
df_train.sample(5)

Unnamed: 0,Time,Rad Flow,Fpv Close,Fpv Open,High,Bypass,Bpv Close,Bpv Open,Class
14800,4,86,0,52,10,35,34,0,1
42439,3,88,0,54,23,35,33,0,1
11953,-4,77,2,42,0,35,35,0,1
14702,-1,83,0,50,1,34,33,0,1
2903,4,97,-3,50,-3,40,48,8,4


In [4]:
# Print the class breakdown
groups = df_train.groupby("Class")
groups.size()

Class
1    34108
2       37
3      132
4     6748
5     2458
6        6
7       11
dtype: int64

In [5]:
# Splitting the dataset
X_train = df_train.drop(columns=["Class"])
y_train = df_train["Class"]
X_test = df_test.drop(columns=["Class"])
y_test = df_test["Class"]

In [6]:
# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
# Training a Random Forest classifier
rfc1 = RandomForestClassifier(n_jobs=-1)
rfc1.fit(X_train_scaled, y_train)

# Predictions and Evaluation
predictions = rfc1.predict(X_test_scaled)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.95      0.97      0.96        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      0.50      0.67         4
           7       1.00      0.50      0.67         2

    accuracy                           1.00     14500
   macro avg       0.99      0.84      0.89     14500
weighted avg       1.00      1.00      1.00     14500



In [8]:
# Applying SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train_scaled, y_train)

# Training a Random Forest classifier
rfc2 = RandomForestClassifier(n_jobs=-1)
rfc2.fit(X_resampled, y_resampled)

# Predictions and Evaluation
predictions = rfc2.predict(X_test_scaled)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.91      1.00      0.95        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      1.00      1.00         4
           7       1.00      1.00      1.00         2

    accuracy                           1.00     14500
   macro avg       0.99      0.99      0.99     14500
weighted avg       1.00      1.00      1.00     14500



In [9]:
# Applying Random Under Sampler
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X_train_scaled, y_train)

# Training a Random Forest classifier
rfc3 = RandomForestClassifier(n_jobs=-1)
rfc3.fit(X_resampled, y_resampled)

# Predictions and Evaluation
predictions = rfc3.predict(X_test_scaled)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       0.99      0.57      0.72     11478
           2       0.02      0.77      0.03        13
           3       0.10      0.64      0.18        39
           4       0.27      0.71      0.39      2155
           5       1.00      1.00      1.00       809
           6       0.12      1.00      0.22         4
           7       0.01      1.00      0.01         2

    accuracy                           0.62     14500
   macro avg       0.36      0.81      0.36     14500
weighted avg       0.88      0.62      0.69     14500



In [10]:
# Adjusting class weights
rfc4 = RandomForestClassifier(class_weight='balanced', n_jobs=-1)
rfc4.fit(X_train_scaled, y_train)

# Predictions and Evaluation
predictions = rfc4.predict(X_test_scaled)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.93      0.97      0.95        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      1.00      1.00         4
           7       1.00      1.00      1.00         2

    accuracy                           1.00     14500
   macro avg       0.99      0.99      0.99     14500
weighted avg       1.00      1.00      1.00     14500



In [11]:
# SVM with class weight adjustment
svm = SVC(kernel='rbf', class_weight='balanced', random_state=42)
svm.fit(X_train_scaled, y_train)

# Predictions and Evaluation
predictions = svm.predict(X_test_scaled)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       1.00      0.99      0.99     11478
           2       0.81      1.00      0.90        13
           3       0.67      0.97      0.79        39
           4       0.96      1.00      0.98      2155
           5       0.98      1.00      0.99       809
           6       1.00      0.25      0.40         4
           7       0.67      1.00      0.80         2

    accuracy                           0.99     14500
   macro avg       0.87      0.89      0.84     14500
weighted avg       0.99      0.99      0.99     14500

