<a href="https://colab.research.google.com/github/wooihaw/imbalanced_classification/blob/main/imbalanced_dataset_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands-on 2
## Classification of Imbalanced Dataset

The Statlog (Shuttle) dataset, utilized prominently in machine learning for classification tasks. This dataset, composed of data derived from the space shuttle, is used for predicting the radiator position on the shuttle: whether it is in a state of radiative cooling or not. One of the primary challenges is the imbalanced distribution of classes within the dataset. The majority of the data points belong to one class (approximately 80% of the data belongs to class 1), while other classes are underrepresented. This imbalance can lead to models that are biased towards predicting the majority class, thereby reducing the overall accuracy of classification for the minority classes.s.

In [14]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings("ignore")

In [15]:
from imblearn.over_sampling import SMOTE, RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from pandas import read_csv

In [16]:
# Load the dataset
df_train = read_csv("data/shuttle_train.csv")
df_test = read_csv("data/shuttle_test.csv")

To do: 
- Print first 5 data samples in the training set

In [17]:
df_train.head()


Unnamed: 0,Time,Rad Flow,Fpv Close,Fpv Open,High,Bypass,Bpv Close,Bpv Open,Class
0,21,77,0,28,0,27,48,22,2
1,0,92,0,0,26,36,92,56,4
2,0,82,0,52,-5,29,30,2,1
3,0,76,0,28,18,40,48,8,1
4,0,79,0,34,-26,43,46,2,1


To do: 
- Print the class breakdown of the training set

In [18]:
df_train.groupby("Class").size()


Class
1    34108
2       37
3      132
4     6748
5     2458
6        6
7       11
dtype: int64

To do: 
- Split the dataset to X_train, X_test, y_train, y_test

In [19]:
X_train = df_train.drop(columns=["Class"])
y_train = df_train["Class"]
X_test = df_test.drop(columns=["Class"])
y_test = df_test["Class"]

To do:
- Train a Random Forest classifier
- Make prediction based on the test features (X_test)
- Print the classification report

In [20]:
rfc1 = RandomForestClassifier().fit(X_train, y_train)
y_pred = rfc1.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.95      1.00      0.97        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      0.75      0.86         4
           7       1.00      0.50      0.67         2

    accuracy                           1.00     14500
   macro avg       0.99      0.88      0.92     14500
weighted avg       1.00      1.00      1.00     14500



To do: 
- Train a Random Forest classifier on the data resampled using SMOTE
- Make predictions based on the test features (X_test)
- Print the classification report 

In [22]:
# Applying SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

rfc2 = RandomForestClassifier().fit(X_resampled, y_resampled)
y_pred = rfc2.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.93      1.00      0.96        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      1.00      1.00         4
           7       1.00      1.00      1.00         2

    accuracy                           1.00     14500
   macro avg       0.99      0.99      0.99     14500
weighted avg       1.00      1.00      1.00     14500



To do: 
- Train a Random Forest classifier on the data resampled using Random Over Sampler
- Make predictions based on the test features (X_test)
- Print the classification report 

In [23]:
# Applying Random Over Sampler
rus = RandomOverSampler()
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

rfc3 = RandomForestClassifier().fit(X_resampled, y_resampled)
y_pred = rfc3.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.95      1.00      0.97        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      1.00      1.00         4
           7       1.00      1.00      1.00         2

    accuracy                           1.00     14500
   macro avg       0.99      0.99      0.99     14500
weighted avg       1.00      1.00      1.00     14500



To do: 
- Train a Random Forest classifier by setting the class_weight to "balanced"
- Make predictions based on the test features (X_test)
- Print the classification report 

In [24]:
rfc4 = RandomForestClassifier(class_weight="balanced").fit(X_train, y_train)
y_pred = rfc4.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

           1       1.00      1.00      1.00     11478
           2       1.00      0.92      0.96        13
           3       0.93      0.97      0.95        39
           4       1.00      1.00      1.00      2155
           5       1.00      1.00      1.00       809
           6       1.00      0.75      0.86         4
           7       1.00      1.00      1.00         2

    accuracy                           1.00     14500
   macro avg       0.99      0.95      0.97     14500
weighted avg       1.00      1.00      1.00     14500

