# <span style="color:#3665af">Semi-Supervised Learning </span>
<hr>

###### Goal
In this assignment, we will explore the concepts and techniques of semi-supervised learning.

###### Prerequisites
This assignment has the following dependencies:
- Jupyter Notebook, along with the following libraries (which should be installed on the Computing Platform):
  - Scikit Learn
  - Numpy
  - os

Let's dive into the world of semi-supervised learning!

<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">Assignment Hands-on 

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Import libraries </div>

In [1]:
# importing libraries
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense,
                                     Flatten,
                                     Dropout,
                                     BatchNormalization,
                                     Conv2D,
                                     MaxPooling2D,)
from tensorflow.keras.regularizers import l2


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Learn more about the data </div>


In this assignment, we are working with data collected from two different view from same region via satellite technology to study the Arctic region. These two data types offer valuable insights into various sea ice types, thereby enhancing navigation in the Arctic.

1. **Sentinel-1 Data (view 1):** We use Synthetic Aperture Radar (SAR) satellite images from the Sentinel-1 mission. SAR images are incredibly useful for creating sea ice charts in the Arctic. SAR works by sending radar signals to the Earth's surface and capturing the signals that bounce back. One specific view we're utilizing is the Sentinel-1 image captured in HH polarization. This view helps us understand the characteristics of the sea ice in the Arctic. if you want to know more about it here is the [link](https://en.wikipedia.org/wiki/Sentinel-1).

2. **AMSR2 Data (view 2):** Alongside each Sentinel-1 image, we have corresponding data from the Advanced Microwave Scanning Radiometer 2 (AMSR2). This dataset contains information about the brightness temperatures of the Earth's surface. AMSR2 measures microwave radiation, which can be used to gather information about surface properties like sea ice concentration. if you want to know more about it here is the [link](https://www.ospo.noaa.gov/Products/atmosphere/gpds/about_amsr2.html).

To further analyze these datasets, we selected 10 files and divided the images into smaller patches, each patch 32 by 32 pixels. This patches allows us to focus on specific areas of interest within the Arctic and study them in detail. By combining the information from both Sentinel-1 and AMSR2 data, we can gain a comprehensive understanding of the Arctic environment and its sea ice patterns, which is crucial for various scientific and practical applications, including safe navigation in this challenging region.

view 1: Sentinel-1 image

<img alt="nersc_sar_primary view" src="nersc_sar_primary.jpg"/>

view 2: AMSR2 image

<img src="btemp_89_0h.jpg" alt = "btemp_89_0h view" >

Download the data.zip file from Canvas, and then execute the cell below to import the data. You can customize the directory name for the data if necessary.

In [2]:
def load_data_from_directories(view1_dir, view2_dir, labels_dir):
    """
    Load data from directories containing two views and corresponding labels.

    Parameters:
    - view1_dir (str): Path to the directory containing view 1 data files.
    - view2_dir (str): Path to the directory containing view 2 data files.
    - labels_dir (str): Path to the directory containing label data files.

    Returns:
    - view1_data (numpy.ndarray): NumPy array containing data from view 1.
    - view2_data (numpy.ndarray): NumPy array containing data from view 2.
    - labels_data (numpy.ndarray): NumPy array containing label data.

    This function loads data from two views and their corresponding labels, assuming a common "number" part
    in the file names for matching files. It ensures that data files from both views and labels are consistent
    and loads them into NumPy arrays for further processing.
    """
    # List all files in each directory
    files_view1 = os.listdir(view1_dir)
    files_view2 = os.listdir(view2_dir)
    files_label = os.listdir(labels_dir)

    # Initialize empty lists to store data from each view and labels
    view1_data = []
    view2_data = []
    labels_data = []

    # Iterate through the files in the directory
    for filename in files_view1:
        if filename.endswith('_samples_view1.npy'):
            # Extract the common "number" part of the file name
            common_number = filename.split('_')[0]

            # Check if corresponding files exist for view2 and labels
            if common_number + '_samples_view2.npy' in files_view2 and common_number + '_labels.npy' in files_label:
                # Load data from the NumPy files
                data_view1 = np.load(os.path.join(view1_dir, filename))
                data_view2 = np.load(os.path.join(view2_dir, common_number + '_samples_view2.npy'))
                data_labels = np.load(os.path.join(labels_dir, common_number + '_labels.npy'))

                # Append data to respective lists
                view1_data.append(data_view1)
                view2_data.append(data_view2)
                labels_data.append(data_labels)

    view1_data = np.array(view1_data)
    view2_data = np.array(view2_data)
    labels_data = np.array(labels_data)

    return view1_data, view2_data, labels_data


view1_dir = 'C:/Users/srina/Documents/Big Data science/data/view1/' 
view2_dir = 'C:/Users/srina/Documents/Big Data science/data/view2/' 
labels_dir = 'C:/Users/srina/Documents/Big Data science/data/labels/' 

view1_data, view2_data, labels_data = load_data_from_directories(view1_dir, view2_dir, labels_dir)


print(" shape view 1 data: ", view1_data.shape)
print(" shape view 2 data: ", view2_data.shape)
print(" shape labels data: ", labels_data.shape)


 shape view 1 data:  (13683, 32, 32)
 shape view 2 data:  (13683, 32, 32)
 shape labels data:  (13683, 1)


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part  1. Co-Training Models for Sea Ice Classification</div>

In this task, you'll be applying the Co-training technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available. You may want to revisit Lecture 02, which covers the topic of cotraining for a more comprehensive understanding

1-1. Divide the dataset into three distinct sets: one for labeled data, one for unlabeled data, and one for test data. Make sure that the labeled dataset contains between 100 and 130 data.

In [3]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.

labeled_set = {}
unlabeled_set = {}
test_set = {}

def split_dataset(dataset_view1, dataset_view2, labeled_size=120, test_size=0.2, random_seed=42):
    """
    Split the dataset into labeled, unlabeled, and test sets.

    Parameters:
    - dataset (list or array-like): The input dataset to be split.
    - labeled_size (int): The target size for the labeled set (default: 130).
    - test_size (float): The proportion of the dataset to include in the test split (default: 0.2).
    - random_seed (int): Seed for reproducibility (default: None).

    Returns:
    - labeled_set_view1: Subset of the dataset with labeled data (approximately 100-130 points).
    - labeled_set_view2: Subset of the dataset with labeled data (approximately 100-130 points).
    - label_labeled_set: Labels corresponding to the labeled data points.
    - unlabeled_set: Subset of the dataset with unlabeled data.
    - test_set: Subset of the dataset for testing.
    - label_test_set: Labels corresponding to the test data points.
    """
    np.random.seed(random_seed)
    n_samples = len(dataset_view1)
    indices = np.random.permutation(n_samples)
    
    n_labeled = int(labeled_size)
    n_test = int(n_samples * test_size)

    end_labeled = n_labeled
    start_test = end_labeled
    end_test = start_test + n_test
    
    labeled_indices = indices[:end_labeled]
    test_indices = indices[start_test:end_test]
    unlabeled_indices = indices[end_test:]

    datasets = {
        "view1": dataset_view1,
        "view2": dataset_view2,
        "labels": labels_data  
    }

    def extract(subset_indices):
        return {key: value[subset_indices] for key, value in datasets.items()}

    labeled_set = extract(labeled_indices)
    test_set = extract(test_indices)
    unlabeled_set = {key: value[unlabeled_indices] for key, value in datasets.items()}
    
    labeled_set_view1 = labeled_set["view1"]
    labeled_set_view2 = labeled_set["view2"]
    label_labeled_set = labeled_set["labels"]
    label_test_set = test_set["labels"]

    return (
        labeled_set_view1, labeled_set_view2, label_labeled_set,
        unlabeled_set,
        test_set, test_set["labels"]
    )



1-2. initialize two classifiers for each view using scikit-learn. Consider using a Convolutional Neural Network (CNN) as one of the classifiers and a Random Forest as the other.
Here's a short description of the configuration for the CNN (Convolutional Neural Network) and Random Forest (RF) classifiers to implement:

**CNN Classifier Configuration:**

1. Input Layer: BatchNormalization with input shape (32, 32, 1).
2. Convolutional Layer 1: 32 filters, each with a 3x3 kernel and ReLU activation.
3. Max Pooling Layer 1: 2x2 pooling with a stride of 2.
4. Convolutional Layer 2: 32 filters, each with a 3x3 kernel and ReLU activation.
5. Max Pooling Layer 2: 2x2 pooling with a stride of 2.
6. Convolutional Layer 3: 32 filters, each with a 3x3 kernel and ReLU activation.
7. Max Pooling Layer 3: 2x2 pooling with a stride of 2.
8. BatchNormalization Layer.
9. Flatten Layer.
10. Dropout Layer with a dropout rate of 0.1.
11. Fully Connected Layer 1: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
12. Dropout Layer with a dropout rate of 0.1.
13. Fully Connected Layer 2: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
14. Dropout Layer with a dropout rate of 0.1.
15. Output Layer: Dense layer with the number of neurons equal to the number of classes and softmax activation.

**CNN Model Compilation:**
- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Metrics for Evaluation: Accuracy

**Random Forest Classifier Configuration:**
- Number of Estimators: 20
- Random State: 42




In [4]:
num_classes = 6
# write your code here
#implement classifiers based on the provided definition
#cnn_classifier =
#rf_classifier =

def create_cnn_classifier(num_classes):
    model = Sequential()
    model.add(BatchNormalization(input_shape=(32, 32, 1)))
    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=2))
    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=2))
    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=2))
    model.add(BatchNormalization())
    model.add(Flatten())
    model.add(Dropout(0.1))
    model.add(Dense(16, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.1))
    model.add(Dense(16, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.1))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

def create_rf_classifier(num_estimators, random_state):
    rf_clf = RandomForestClassifier(n_estimators=num_estimators, random_state=random_state)
    return rf_clf




1-3. Do the co-training part:
   - Train classifiers on the labeled data
   - Predict on the unlabeled data and identify instances that have a confidence score more than 90.
   - Add the confident instances to the labeled set and train again
   - Compute the accuracy of the classifiers on test set
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).
<img src="accuracy.jpg" alt = "accuracy metric" >

In [5]:
 # Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def co_training(classifier1, classifier2, labeled_set, unlabeled_set, test_set , threshold_confidence):
    """
    Perform co-training with two classifiers on labeled and unlabeled data.

    Parameters:
    - classifier1: The first classifier (e.g., CNN).
    - classifier2: The second classifier (e.g., Random Forest).
    - labeled_set (list or array-like): Labeled dataset.
    - unlabeled_set (list or array-like): Unlabeled dataset.
    - test_set (list or array-like): Test dataset.
    - threshold_confidence (float): The minimum confidence threshold for adding unlabeled samples to the training set.

    Returns:
    - classifier1_accuracy (float): Accuracy of Classifier 1 on the test set after co-training.
    - classifier2_accuracy (float): Accuracy of Classifier 2 on the test set after co-training.
    """
    samples_1, x1, y1 = labeled_set["view2"].shape
    labeled_set_2_2D = labeled_set["view2"].reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = unlabeled_set["view2"].shape
    unlabeled_set_2_2D = unlabeled_set["view2"].reshape((samples_2,x2*y2))
    
    samples_3, x3, y3 = test_set["view2"].shape
    test_set_2_2D = test_set["view2"].reshape((samples_3,x3*y3))
    
    classifier1.fit(labeled_set["view1"], labeled_set["labels"])
    classifier2.fit(labeled_set_2_2D, labeled_set["labels"])

    predictions1 = classifier1.predict(unlabeled_set["view1"])
    predictions2 = classifier2.predict(unlabeled_set_2_2D)

    confidences1 = np.max(classifier1.predict(unlabeled_set["view1"]), axis=1)
    confidences2 = np.max(classifier2.predict_proba(unlabeled_set_2_2D), axis=1)

    confident_samples_1 = np.where(confidences1 > threshold_confidence)[0]
    confident_samples_2 = np.where(confidences2 > threshold_confidence)[0]

    unlabeled_set_1_temp = unlabeled_set["view1"]
    updated_labeled_set_1 = np.vstack((labeled_set["view1"], unlabeled_set_1_temp[confident_samples_1]))
    updated_label_labeled_set_1 = np.append(labeled_set["labels"], predictions1[confident_samples_1])
        
    updated_labeled_set_2_2D = np.vstack((labeled_set_2_2D, unlabeled_set_2_2D[confident_samples_2]))
    updated_label_labeled_set_2 = np.append(labeled_set["labels"], predictions2[confident_samples_2])

    classifier1.fit(updated_labeled_set_1, updated_label_labeled_set_1)
    classifier2.fit(updated_labeled_set_2_2D, updated_label_labeled_set_2)

    # Compute accuracy on the test set
    classifier1_accuracy = classifier1.evaluate(test_set["view1"], test_set["labels"])
    classifier2_accuracy = classifier2.score(test_set_2_2D, test_set["labels"])

    return classifier1_accuracy, classifier2_accuracy

1-4. pick one of the classifiers and do the supervised training with the labeled data and calculate the accuracy

In [6]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_2_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_3, x3, y3 = test_data.shape
    test_set_2_2D = test_data.reshape((samples_3,x3*y3))
    
    classifier.fit(labeled_set_2_2D, labeled_labels)
    accuracy = classifier.score(test_set_2_2D, test_labels)
    return accuracy

1-5. Compare the Co-training approach accuracy and supervised model with limited labeled data and write your reason about it.

In [7]:
# your code and answer here
labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 120)

cnn_classifier = create_cnn_classifier(num_classes)

rf_classifier = create_rf_classifier(num_estimators=20, random_state=42)

cnn_accuracy, rf_accuracy = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

rf_supervised_accuracy = supervised_training_and_accuracy(rf_classifier, labeled_set["view2"], labeled_set["labels"], test_set["view2"], test_set["labels"])

print(f"Co-training CNN Accuracy: {cnn_accuracy}")
print(f"Co-training Random Forest Accuracy: {rf_accuracy}")
print(f"Supervised Random Forest Accuracy: {rf_supervised_accuracy}")

"""
 Answer 
 Here we are comparing the co-training and supervised learning approaches. When see result you can clearly say that 
 co-training method produces better results than supervised. In co-training we are using both CNN and Random forest for different views of data.
 while in supervised we are considereing random forest. If you only compare the random forest accaurcies only, random forest in co-training gives better results 
 because unlabeled data is also used in training of co-training approach.
""" 



  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])


Co-training CNN Accuracy: [1.7494062185287476, 0.6165935397148132]
Co-training Random Forest Accuracy: 0.893640350877193
Supervised Random Forest Accuracy: 0.8636695906432749


  classifier.fit(labeled_set_2_2D, labeled_labels)


'\n Answer \n Here we are comparing the co-training and supervised learning approaches. When see result you can clearly say that \n co-training method produces better results than supervised. In co-training we are using both CNN and Random forest for different views of data.\n while in supervised we are considereing random forest. If you only compare the random forest accaurcies only, random forest in co-training gives better results \n because unlabeled data is also used in training of co-training approach.\n'

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 2. Label Propagation for Sea Ice Classification</div>


In this task, you'll be applying the label propagation technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available.


<img src="label_propagation.jpg" alt = "label propgation process" >

 2-1. Apply the K-Nearest Neighbors (KNN) algorithm with a parameter configuration where n_neighbors is set to 7 for the label propagation model. Utilize one of the labeled data views and the corresponding unlabeled data from part 1 as input.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [8]:
from sklearn.semi_supervised import LabelPropagation
from sklearn.neighbors import KNeighborsClassifier

# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def label_propagation(labeled_data, unlabeled_data, labeled_labels, test_data, label_test, n_neighbors=7):
    """
    Apply K-Nearest Neighbors (KNN) to the label propagation model on one data view and test data.

    Parameters:
    - labeled_data (array-like): Labeled data points.
    - unlabeled_data (array-like): Unlabeled data points.
    - labeled_labels (array-like): Labels corresponding to the labeled data points.
    - test_data (array-like): Test data to evaluate label propagation performance.
    - label_test (array-like): Labels corresponding to the test data points.
    - n_neighbors (int): Number of neighbors to consider in KNN (default: 7).

    Returns:
    - accuracy (float): Accuracy of label propagation on the test data.
    """
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_1_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = unlabeled_data.shape
    unlabeled_set_1_2D = unlabeled_data.reshape((samples_2,x2*y2))
    
    samples_3, x3, y3 = test_data.shape
    test_set_1_2D = test_data.reshape((samples_3,x3*y3))
    
    label_prop_model = LabelPropagation(kernel="knn", n_neighbors=n_neighbors, max_iter=1000)
    
    label_prop_model.fit(labeled_set_1_2D, labeled_labels)
    
    predicted_labels = label_prop_model.predict(unlabeled_set_1_2D)
    
    data_combined = np.vstack([labeled_set_1_2D, unlabeled_set_1_2D])
    
    all_labels = np.concatenate([labeled_labels.flatten(), predicted_labels])
    
    knn_classifier = KNeighborsClassifier(n_neighbors=n_neighbors)

    knn_classifier.fit(data_combined, all_labels)

    test_predicted_labels = knn_classifier.predict(test_set_1_2D)
    
    accuracy = accuracy_score(label_test, test_predicted_labels)
    
    return accuracy

label_prop_accuracy = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)

print(f"KNN Accuracy for label propogation: {label_prop_accuracy}")
    

  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation: 0.8088450292397661


2-2. Select a classification algorithm and perform supervised learning on the labeled set. Then, evaluate the model's performance by calculating the accuracy. You can use a built-in library for the classifier. Compare your sepervised and semi supervised accuracy.

In [9]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_1_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = test_data.shape
    test_data_1_2D = test_data.reshape((samples_2,x2*y2))
    
    classifier.fit(labeled_set_1_2D, labeled_labels)
    
    predictions = classifier.predict(test_data_1_2D)
    
    accuracy = accuracy_score(test_labels, predictions)
    
    return accuracy

rnd_f_accuracy = supervised_training_and_accuracy(rf_classifier, labeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"])

print(f"Supervised training accuracy: {rnd_f_accuracy}")

"""
Answer: 
When you see the accuraices of both semi-supervised and supervised learning approaches. Supervised learning gives better result 
because the whole dataset which is considered for training the model is labeled, while in semi-supervised, you model is predicting the 
unlabeled data as well. With all considered, the accuracy of semi-supervised is near to the supervised which is comsidered remarkable 
because it saves you all the time labelling data and more data to train your model.
"""

Supervised training accuracy: 0.8636695906432749


  classifier.fit(labeled_set_1_2D, labeled_labels)


'\nAnswer: \nWhen you see the accuraices of both semi-supervised and supervised learning approaches. Supervised learning gives better result \nbecause the whole dataset which is considered for training the model is labeled, while in semi-supervised, you model is predicting the \nunlabeled data as well. With all considered, the accuracy of semi-supervised is near to the supervised which is comsidered remarkable \nbecause it saves you all the time labelling data and more data to train your model.\n'

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 3. Now let's perform some experimentation and make some observations!</div>



3-1. We will explore the impact of varying the threshold confidence in the co-training process at three different values: 80, 70, and 60. We will then assess the accuracy of co-training based on these threshold settings.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [10]:
# your code here
labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 120)

cnn_accuracy_1, rf_accuracy_1 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.80)

labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 120)


cnn_accuracy_2, rf_accuracy_2 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.70)

labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 120)


cnn_accuracy_3, rf_accuracy_3 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.60)



  1/339 [..............................] - ETA: 11s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




In [11]:
print(f"Co-training of CNN Accuracy: {cnn_accuracy_1} and Random Forest Accuracy: {rf_accuracy_1}")
print(f"Co-training of CNN Accuracy: {cnn_accuracy_2} and Random Forest Accuracy: {rf_accuracy_2}")
print(f"Co-training of CNN Accuracy: {cnn_accuracy_3} and Random Forest Accuracy: {rf_accuracy_3}")

Co-training of CNN Accuracy: [1.73076593875885, 0.6165935397148132] and Random Forest Accuracy: 0.893640350877193
Co-training of CNN Accuracy: [1.6942936182022095, 0.6165935397148132] and Random Forest Accuracy: 0.893640350877193
Co-training of CNN Accuracy: [1.6920809745788574, 0.6165935397148132] and Random Forest Accuracy: 0.893640350877193


3-2. Change the parameters of the K-Nearest Neighbors (KNN) algorithm for Label Propagation (part 2) with the values 3, 5, and 10, and explain what you understand about these parameter adjustments.


In [12]:
# your code here
label_prop_accuracy_1 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=3)
label_prop_accuracy_2 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=5)
label_prop_accuracy_3 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=10)


print(f"KNN Accuracy for label propogation for 3: {label_prop_accuracy_1}")
print(f"KNN Accuracy for label propogation for 5: {label_prop_accuracy_2}")
print(f"KNN Accuracy for label propogation for 10: {label_prop_accuracy_3}")

"""
Answer:
When you see the outputs for different value of K's, as the K value increases the accuracy of the model is decreasing this 
is because as k value increase it is more overfitting and less bias towards the outliers. But when the K values is low and 
values are near it does have negligible effect on accauray has you can see in 2 and 5.
"""

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation for 3: 0.8212719298245614
KNN Accuracy for label propogation for 5: 0.8212719298245614
KNN Accuracy for label propogation for 10: 0.8088450292397661


"\nAnswer:\nWhen you see the outputs for different value of K's, as the K value increases the accuracy of the model is decreasing this \nis because as k value increase it is more overfitting and less bias towards the outliers. But when the K values is low and \nvalues are near it does have negligible effect on accauray has you can see in 2 and 5.\n"



3-3. Let's see the impact of of a simplifies models for cotraining approach.
- Reduce the number of convolutional layers in the question 1-2 from 3 to 1 convolution layer and the rest of the layers is the same
- Change the number of trees for the random forest algorithm to 1.
Evaluate the performance of cotraining approach.
- Additionally, use the 1 layer convolution layer as the supervised model and evaluate the performance for supervised learning.


In [13]:
# your code here
def modified_cnn_classifier(num_classes):
    model = Sequential()
    model.add(BatchNormalization(input_shape=(32, 32, 1)))
    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=2))
    model.add(BatchNormalization())
    model.add(Flatten())
    model.add(Dropout(0.1))
    model.add(Dense(16, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.1))
    model.add(Dense(16, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.1))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

def modified_rf_classifier(num_estimators, random_state):
    rf_clf = RandomForestClassifier(n_estimators=num_estimators, random_state=random_state)
    return rf_clf

def modified_supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    classifier.fit(labeled_data, labeled_labels)
    accuracy = classifier.evaluate(test_data, test_labels)
    return accuracy

cnn_classifier_1 = modified_cnn_classifier(num_classes =6)

rf_classifier_1 = modified_rf_classifier(num_estimators=1, random_state=42)

# Co-training
cnn_accuracy_4, rf_accuracy_4 = co_training(cnn_classifier_1 , rf_classifier_1, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

# Supervised training with one of the classifiers (e.g., Random Forest)
cnn_supervised_accuracy = modified_supervised_training_and_accuracy(cnn_classifier_1, labeled_set["view2"], labeled_set["labels"], test_set["view2"], test_set["labels"])

# Compare accuracies
print(f"Co-training CNN Accuracy: {cnn_accuracy_4}")
print(f"Co-training Random Forest Accuracy: {rf_accuracy_4}")
print(f"Supervised Random Forest Accuracy: {cnn_supervised_accuracy}")




  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])


Co-training CNN Accuracy: [1.722702980041504, 0.37061402201652527]
Co-training Random Forest Accuracy: 0.8154239766081871
Supervised Random Forest Accuracy: [1.4786882400512695, 0.5862573385238647]


3-4. Let's adjust the amount of labeled data in part 2 by considering two different quantities: 200 and 400 labeled data points. In each scenario, the remaining data will remain unlabeled. Evaluate the performance of label propagation under these labeled data scenarios.


In [14]:
# your code here
labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 200)
label_prop_accuracy2 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)

labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size= 400)
label_prop_accuracy3 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)

print(f"KNN Accuracy for label propogation for labeled size of 200: {label_prop_accuracy2}")
print(f"KNN Accuracy for label propogation for labeled size of 400: {label_prop_accuracy3}")

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation for labeled size of 200: 0.8870614035087719
KNN Accuracy for label propogation for labeled size of 400: 0.8863304093567251



3-5. Let's adjust the number of labeled data samples for part 1. Consider three scenarios: one with 200 labeled samples, another with 400 labeled samples, and a third with 600 labeled samples. In each scenario, the remaining data will remain unlabeled. Additionally, include an explanation of your understanding of how these parameter changes impact the algorithm


In [15]:
labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size = 200)
cnn_acc, rf_acc = co_training(cnn_classifier_1 , rf_classifier_1, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

rf_supervised_accu = supervised_training_and_accuracy(rf_classifier, labeled_set["view2"], labeled_set["labels"], test_set["view2"], test_set["labels"])

print(f"Co-training CNN Accuracy for 200 labeled: {cnn_acc}")
print(f"Co-training Random Forest Accuracy for 200 labeled: {rf_acc}")
print(f"Supervised Random Forest Accuracy for 200 labeled: {rf_supervised_accu}")

labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size = 400)
cnn_acc, rf_acc = co_training(cnn_classifier_1 , rf_classifier_1, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

rf_supervised_accu = supervised_training_and_accuracy(rf_classifier, labeled_set["view2"], labeled_set["labels"], test_set["view2"], test_set["labels"])

print(f"Co-training CNN Accuracy for 400 labeled: {cnn_acc}")
print(f"Co-training Random Forest Accuracy for 400 labeled: {rf_acc}")
print(f"Supervised Random Forest Accuracy for 400 labeled: {rf_supervised_accu}")

labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = split_dataset(view1_data, view2_data, labeled_size = 600)

cnn_acc, rf_acc = co_training(cnn_classifier_1 , rf_classifier_1, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

rf_supervised_accu = supervised_training_and_accuracy(rf_classifier, labeled_set["view2"], labeled_set["labels"], test_set["view2"], test_set["labels"])

print(f"Co-training CNN Accuracy for 600 labeled: {cnn_acc}")
print(f"Co-training Random Forest Accuracy for 600 labeled: {rf_acc}")
print(f"Supervised Random Forest Accuracy for 600 labeled: {rf_supervised_accu}")


"""
Answer:
As the size of the labeled data is increased the accuracy of both co-training and supervised learning accuracies are increasing which is expected because 
you are providing more accurate data to train your model."""

  9/336 [..............................] - ETA: 2s 

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])


Co-training CNN Accuracy for 200 labeled: [1.4388943910598755, 0.8399122953414917]
Co-training Random Forest Accuracy for 200 labeled: 0.8870614035087719
Supervised Random Forest Accuracy for 200 labeled: 0.8929093567251462


  classifier.fit(labeled_set_1_2D, labeled_labels)


  1/330 [..............................] - ETA: 10s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])


Co-training CNN Accuracy for 400 labeled: [1.3612128496170044, 0.8735380172729492]
Co-training Random Forest Accuracy for 400 labeled: 0.8793859649122807
Supervised Random Forest Accuracy for 400 labeled: 0.8881578947368421


  classifier.fit(labeled_set_1_2D, labeled_labels)


  1/324 [..............................] - ETA: 10s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])


Co-training CNN Accuracy for 600 labeled: [1.242202877998352, 0.8051900863647461]
Co-training Random Forest Accuracy for 600 labeled: 0.8914473684210527
Supervised Random Forest Accuracy for 600 labeled: 0.8914473684210527


  classifier.fit(labeled_set_1_2D, labeled_labels)


'\nAnswer:\nAs the size of the labeled data is increased the accuracy of both co-training and supervised learning accuracies are increasing which is expected because \nyou are providing more accurate data to train your model.'


3-6. Evalute the perfomance for different number of unlabeled data size.
- Set labeled data size within the range of 100 to 130 and
- Set the unlabeled data sizes at 200, 400, and 600.
- Execute the algorithms and provide accuracy reports for both approaches: co-training and label propagation.


In [16]:
# your code here
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.

labeled_set = {}
unlabeled_set = {}
test_set = {}

def modified_split_dataset(dataset_view1, dataset_view2, unlabeled_size, labeled_size=120,  random_seed=42):

    np.random.seed(random_seed)
    n_samples = len(dataset_view1)
    indices = np.random.permutation(n_samples)
    
    n_labeled = int(labeled_size)
    n_unlabeled = int(unlabeled_size)

    end_labeled = n_labeled
    start_unlabeled = end_labeled
    end_unlabeled = end_labeled + n_unlabeled
    
    labeled_indices = indices[:end_labeled]
    unlabeled_indices = indices[start_unlabeled:end_unlabeled]
    test_indices = indices[end_unlabeled:]

    datasets = {
        "view1": dataset_view1,
        "view2": dataset_view2,
        "labels": labels_data  
    }

    def extract(subset_indices):
        return {key: value[subset_indices] for key, value in datasets.items()}

    labeled_set = extract(labeled_indices)
    test_set = extract(test_indices)
    unlabeled_set = {key: value[unlabeled_indices] for key, value in datasets.items()}
    
    labeled_set_view1 = labeled_set["view1"]
    labeled_set_view2 = labeled_set["view2"]
    label_labeled_set = labeled_set["labels"]
    label_test_set = test_set["labels"]

    return (
        labeled_set_view1, labeled_set_view2, label_labeled_set,
        unlabeled_set,
        test_set, test_set["labels"]
    )



In [17]:
labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = modified_split_dataset(view1_data, view2_data, unlabeled_size = 200)

cnn_acc_5, rf_acc_5 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

label_prop_accuracy5 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)



labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = modified_split_dataset(view1_data, view2_data, unlabeled_size = 400)

cnn_acc_6, rf_acc_6 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

label_prop_accuracy6 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)



labeled_set["view1"], labeled_set["view2"], labeled_set["labels"], unlabeled_set, test_set, test_set["labels"] = modified_split_dataset(view1_data, view2_data, unlabeled_size = 600)

cnn_acc_7, rf_acc_7 = co_training(cnn_classifier , rf_classifier, 
                                        labeled_set,
                                        unlabeled_set,
                                        test_set,
                                        threshold_confidence= 0.90)

label_prop_accuracy7 = label_propagation(labeled_set["view1"], unlabeled_set["view1"], labeled_set["labels"], test_set["view1"], test_set["labels"], n_neighbors=7)


1/7 [===>..........................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




  y = column_or_1d(y, warn=True)


 1/13 [=>............................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




  y = column_or_1d(y, warn=True)


 1/19 [>.............................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, labeled_set["labels"])




  y = column_or_1d(y, warn=True)


In [18]:
print(f"Co-training of unlabeled 200 of CNN Accuracy: {cnn_acc_5} and Random Forest Accuracy: {rf_acc_5}")
print(f"KNN Accuracy for label propogation for 200 unlabeled size: {label_prop_accuracy5}")
print(f"Co-training of unlabeled 400 of CNN Accuracy: {cnn_acc_6} and Random Forest Accuracy: {rf_acc_6}")
print(f"KNN Accuracy for label propogation for 400 unlabeled size: {label_prop_accuracy6}")
print(f"Co-training of unlabeled 600 of CNN Accuracy: {cnn_acc_7} and Random Forest Accuracy: {rf_acc_7}")
print(f"KNN Accuracy for label propogation for 600 unlabeled size: {label_prop_accuracy7}")

Co-training of unlabeled 200 of CNN Accuracy: [1.6588548421859741, 0.7049314975738525] and Random Forest Accuracy: 0.8861034198907431
KNN Accuracy for label propogation for 200 unlabeled size: 0.8088004190675746
Co-training of unlabeled 400 of CNN Accuracy: [1.6425343751907349, 0.7054622769355774] and Random Forest Accuracy: 0.8865000379852617
KNN Accuracy for label propogation for 400 unlabeled size: 0.809086074603054
Co-training of unlabeled 600 of CNN Accuracy: [1.5885318517684937, 0.6072668433189392] and Random Forest Accuracy: 0.8862146108153977
KNN Accuracy for label propogation for 600 unlabeled size: 0.8086862608963974
