# **Semi-Supervised Learning Demo**

## Load & Preprocess Iris Dataset

*   Lets consider **Classification problem with Iris dataset** having 150 samples and 4
features.
*   We simulate a Semi-supervised learning situation by creating a **Small (20%) Labeled set (X_labeled, y_labeled)** and a **Large (80%) Unlabeled set (X_unlabeled)**.

*   Labels in the **y_unlabeled are set to -1** to represent unlabeled data.
*   You can implement more **advanced pseudo-labeling** by training a model with a smaller **Labeled dataset** iteratively retrain the model using Unlabeled dataset's Predictions as **pseudo-labels on the unlabeled data**.

### Import Necessary Python Libraries

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import SelfTrainingClassifier

import pandas as pd

print('Libaries Imported Succesfully!')

Libaries Imported Succesfully!


### Load Iris Dataset & Observe Features (X) & Labels/Targets (y)

In [2]:
# Load dataset (Iris dataset for demonstration)
data = load_iris()
X = data.data
y = data.target

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (150, 4)
Shape of y: (150,)


In [3]:
print('feature_names === ', data.feature_names)
print('feature_names === ', data.target_names)
data.target

feature_names ===  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
feature_names ===  ['setosa' 'versicolor' 'virginica']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [4]:
print(y[5:12])
print(y[55:62])
print(y[105:112])

[0 0 0 0 0 0 0]
[1 1 1 1 1 1 1]
[2 2 2 2 2 2 2]


In [5]:
# Convert the NumPy array to a Pandas Series
y_series = pd.Series(y)

counts = y_series.value_counts()
print(counts)

0    50
1    50
2    50
Name: count, dtype: int64


In [6]:
X[5:8]

array([[5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2]])

### Consider Train Dataset as **Labeled** & Test Dataset as **Unlabeled**

In [7]:
# Let's pretend we have a small labeled dataset and a larger unlabeled dataset
# Split into labeled and unlabeled data
X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(X, y, test_size=0.8, random_state=42)

# Mask some of the labels in the unlabeled data (simulating real-world semi-supervised data)
y_unlabeled = np.full_like(y_unlabeled, -1)  # Set all the labels to -1 (unlabeled data)

print('Shape of X_labeled: ', X_labeled.shape)
print('Length of y_labeled: ', len(y_labeled))

print('Shape of X_unlabeled: ', X_unlabeled.shape)
print('Length of y_unlabeled: ', len(y_unlabeled))

Shape of X_labeled:  (30, 4)
Length of y_labeled:  30
Shape of X_unlabeled:  (120, 4)
Length of y_unlabeled:  120


## Semi-supervised Learning Demo - 1: Using **Semi-Supervised** SelfTrainer Claasifier

### **Combined** Dataset

In [8]:
# Combine labeled and unlabeled data
X_combined = np.concatenate([X_labeled, X_unlabeled], axis=0)
y_combined = np.concatenate([y_labeled, y_unlabeled], axis=0)

print('Shape of X_combined: ', X_combined.shape)
print('Shape of y_combined: ', y_combined.shape)

Shape of X_combined:  (150, 4)
Shape of y_combined:  (150,)


In [9]:
X_combined[50:58]

array([[4.7, 3.2, 1.6, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [5. , 3.4, 1.6, 0.4],
       [6.4, 2.8, 5.6, 2.1],
       [7.9, 3.8, 6.4, 2. ],
       [6.7, 3. , 5.2, 2.3],
       [6.7, 2.5, 5.8, 1.8],
       [6.8, 3.2, 5.9, 2.3]])

In [10]:
y_combined[50:58]

array([-1, -1, -1, -1, -1, -1, -1, -1])

### Preprocessing - Standardized Dataset to address any differences in Units of Features

In [11]:
# Standardize the data
scaler = StandardScaler()
X_combined_scaled = scaler.fit_transform(X_combined)
X_combined_scaled[50:58]

array([[-1.38535265,  0.32841405, -1.22655167, -1.3154443 ],
       [ 0.31099753, -0.13197948,  0.64908342,  0.79067065],
       [-1.02184904,  0.78880759, -1.22655167, -1.05217993],
       [ 0.67450115, -0.59237301,  1.0469454 ,  1.18556721],
       [ 2.4920192 ,  1.70959465,  1.50164482,  1.05393502],
       [ 1.03800476, -0.13197948,  0.8195957 ,  1.44883158],
       [ 1.03800476, -1.28296331,  1.16062026,  0.79067065],
       [ 1.15917263,  0.32841405,  1.21745768,  1.44883158]])

In [12]:
# Initialize a classifier (MLP Classifier in this case)
mlp_demo1 = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
mlp_demo1

### We use SelfTrainingClassifier from sklearn.semi_supervised which wraps a base model (in this case, an MLP classifier) and iteratively adds **pseudo-labels** for unlabeled data, making the model progressively accurate.

In [13]:
# Wrap the classifier with a self-training model
self_training_model = SelfTrainingClassifier(mlp_demo1)

### Model is trained on Combined-Labeled & Unlabeled data where it initially  uses only Labeled data, and then it begins to generate **pseudo-labels** for the Unlabeled data, thereby improving its learning.

In [14]:
# Train the semi-supervised classifier
self_training_model.fit(X_combined_scaled, y_combined)

In [15]:
# Use the trained model to make predictions
y_pred = self_training_model.predict(X_combined_scaled)
y_pred[50:58]

array([0, 2, 0, 2, 2, 2, 2, 2])

### Model 1 Evaluation / Accuracy Metric

In [16]:
# Evaluate Model Accuracy
### Predict Labels for the entire dataset (including the pseudo-labeled data) and evaluate the model's accuracy using accuracy_score.
accuracy = round((accuracy_score(y_combined, y_pred))*100,2)
print(f"Accuracy on combined (labeled + pseudo-labeled) data: {accuracy:.4f}")

Accuracy on combined (labeled + pseudo-labeled) data: 20.0000


## Semi-supervised Learning Demo - 2: Using Only **Labelled** Data, obtain Labels for **Unlabeled** Data; **Retrain Model** Using **High-Confidence** Samples

### Preprocessing - Standardized Dataset to address any differences in Units of Features

In [17]:
# Standardize the data
scaler = StandardScaler()
X_labeled_scaled_2 = scaler.fit_transform(X_labeled)
X_unlabeled_scaled_2 = scaler.transform(X_unlabeled)

print(X_labeled_scaled_2[10:15])
print('*'*50)
print(X_unlabeled_scaled_2[50:55])

[[ 0.87426274  0.77443551  0.96876119  1.72973101]
 [ 1.50172403  0.02498179  1.02981757  0.42518528]
 [-1.38459792  1.52388924 -1.65666304 -1.74905761]
 [ 0.87426274  0.2747997   0.90770481  1.58478149]
 [-1.38459792  0.02498179 -1.65666304 -1.60410809]]
**************************************************
[[-0.38065985 -0.22483612  0.05291553 -0.0096633 ]
 [ 2.00369307  0.02498179  1.51826859  1.14993291]
 [-0.50615211  0.02498179  0.23608466  0.28023575]
 [-1.13361341  1.27407133 -1.65666304 -1.60410809]
 [ 2.12918533 -0.47465402  1.57932497  1.00498338]]


### Initialize a Classifier (MLP Classifier in this case)

In [18]:
# Initialize a classifier (MLP Classifier in this case)
mlp_demo2 = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
mlp_demo2

### Model is trained on Labeled

In [19]:
# Train the model on the labeled dataset only
mlp_demo2.fit(X_labeled_scaled_2, y_labeled)

### Initial Prediction of Target/Label on **Unlabeled** Data

In [20]:
# Initial predictions on the unlabeled data
y_unlabeled_pred_2 = mlp_demo2.predict(X_unlabeled_scaled_2)
y_unlabeled_pred_2[50:58]

array([1, 2, 1, 0, 2, 1, 0, 0])

### Initial **Probablity/Confidence** of Target/Label on **Unlabeled** Data

In [21]:
y_unlabeled_proba = mlp_demo2.predict_proba(X_unlabeled_scaled_2)
y_unlabeled_proba[50:58]

array([[7.49813173e-03, 9.75412319e-01, 1.70895497e-02],
       [6.27226610e-07, 1.35084117e-04, 9.99864289e-01],
       [1.52696249e-02, 7.26535964e-01, 2.58194411e-01],
       [9.99678861e-01, 3.20977923e-04, 1.61324602e-07],
       [2.70666137e-07, 1.79811709e-04, 9.99819918e-01],
       [2.63386568e-03, 9.96352992e-01, 1.01314237e-03],
       [9.99283820e-01, 7.15960076e-04, 2.20006388e-07],
       [9.99693170e-01, 3.06601675e-04, 2.27859383e-07]])

### Obtain Samples with **Probablity/Confidence** Higher than the set Threshold

In [22]:
# Define a threshold for high-confidence predictions
confidence_threshold = 0.9
high_confidence_mask = np.max(y_unlabeled_proba, axis=1) >= confidence_threshold
print('No of Unlabeled Samples with Confidence Level: ', len(high_confidence_mask))
high_confidence_mask

No of Unlabeled Samples with Confidence Level:  120


array([ True,  True,  True, False,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True, False,
        True,  True, False, False,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True, False])

### Get Features and **Predicted** Labels of samples with **High-Confidence** (both Features & Predicted Labels)

In [23]:
# Get the high-confidence samples (both features and predicted labels)
X_high_confidence = X_unlabeled_scaled_2[high_confidence_mask]
y_high_confidence = y_unlabeled_pred_2[high_confidence_mask]

print('Features & Labels of the Samples with High-Confidence') ### 104 Samples have been predicted with Confidence Higher than 0.9
print('*'*50)
print(X_high_confidence.shape)
print(X_high_confidence[50:58])

print('*'*50)
print(y_high_confidence.shape)
print(y_high_confidence[50:58])

Features & Labels of the Samples with High-Confidence
**************************************************
(104, 4)
[[-1.00812115  1.02425342 -1.65666304 -1.60410809]
 [-1.25910566  1.27407133 -1.71771942 -1.45915856]
 [-1.13361341  2.02352505 -1.35138116 -1.31420904]
 [-1.25910566 -2.47319728 -0.37447912 -0.44451188]
 [ 0.3722937  -0.72447193  0.48031017  0.71508433]
 [-1.51009018  1.02425342 -1.35138116 -1.60410809]
 [-1.25910566  0.02498179 -1.53455029 -1.60410809]
 [-1.13361341  0.77443551 -1.47349391 -1.16925951]]
**************************************************
(104,)
[0 0 0 1 2 0 0 0]


### Include **High-Confidence** samples into **Labeled** Dataset to **Retrain** the model - Keep **Repeating the process** until Model Accuracy looks good

In [24]:
# Add the high-confidence samples to the labeled dataset
X_combined_2 = np.concatenate([X_labeled_scaled_2, X_high_confidence], axis=0)
y_combined_2 = np.concatenate([y_labeled, y_high_confidence], axis=0)

print('Shape of X_combined: ', X_combined_2.shape)
print('Shape of y_combined: ', y_combined_2.shape)

Shape of X_combined:  (134, 4)
Shape of y_combined:  (134,)


### **Retrain** Model with Combined dataset of **Labeled** & above **pseudo-Labeled** (High-Confidence Samples)

In [25]:
# Retrain the model with the updated labeled dataset (original + pseudo-labeled data)
mlp_demo2.fit(X_combined_2, y_combined_2)

### Model 2 Evaluation / Accuracy Metric

In [26]:
# Evaluate the retrained model on the combined data (labeled + pseudo-labeled)
y_pred_2 = mlp_demo2.predict(X_combined_2)
y_pred_2

array([1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2, 1, 1,
       2, 2, 0, 1, 2, 0, 1, 2, 1, 0, 2, 1, 0, 1, 2, 1, 2, 0, 0, 0, 0, 2,
       1, 1, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0,
       0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0,
       1, 0, 1, 2, 0, 1, 2, 0, 2, 1, 2, 1, 0, 1, 2, 0, 0, 1, 0, 2, 0, 0,
       1, 2, 2, 1, 0, 0, 2, 0, 0, 0, 2, 0, 2, 2, 0, 1, 1, 1, 2, 0, 2, 1,
       2, 1])

In [27]:
accuracy_2 = round((accuracy_score(y_combined_2, y_pred_2))*100,2)
print(f"Accuracy on combined (labeled + pseudo-labeled) data: {accuracy_2:.4f}")

Accuracy on combined (labeled + pseudo-labeled) data: 100.0000


# Optionally, **Repeat** the process to further improve the Model Performance/Accuracy