<h1 style='text-aling:center;color:Navy'>  Big Data Science - Fall 2023  </h1>
<h1 style='text-aling:center;color:Navy'>  Assignment 1  </h1>

***

<b>Submission Deadline: This assignment is due Friday, November 3 at 8:59 P.M.</b>

A few notes before you start:
- You are not allowed to use built-in libraries for co-training and label propagation itself.
- Directly sharing answers is not okay, but discussing problems with other students is encouraged.
- You should start early so that you have time to get help if you're stuck.

- Complete all the exercises below and turn in a write-up in the form of a Jupyuter notebook, that is, an .ipynb file. The write-up should include your code and answers to exercise questions. You will submit your assignment online as an attachment (*.ipynb), through Canvas under Assignment 1.

# <span style="color:#3665af">Semi-Supervised Learning </span>
<hr>

###### Goal
In this assignment, we will explore the concepts and techniques of semi-supervised learning.

###### Prerequisites
This assignment has the following dependencies:
- Jupyter Notebook, along with the following libraries (which should be installed on the Computing Platform):
  - Scikit Learn
  - Numpy
  - os

Let's dive into the world of semi-supervised learning!

<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">Assignment Hands-on 

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Import libraries </div>

In [1]:
# importing libraries
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense,
                                     Flatten,
                                     Dropout,
                                     BatchNormalization,
                                     Conv2D,
                                     MaxPooling2D,)
from tensorflow.keras.regularizers import l2


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Learn more about the data </div>


In this assignment, we are working with data collected from two different view from same region via satellite technology to study the Arctic region. These two data types offer valuable insights into various sea ice types, thereby enhancing navigation in the Arctic.

1. **Sentinel-1 Data (view 1):** We use Synthetic Aperture Radar (SAR) satellite images from the Sentinel-1 mission. SAR images are incredibly useful for creating sea ice charts in the Arctic. SAR works by sending radar signals to the Earth's surface and capturing the signals that bounce back. One specific view we're utilizing is the Sentinel-1 image captured in HH polarization. This view helps us understand the characteristics of the sea ice in the Arctic. if you want to know more about it here is the [link](https://en.wikipedia.org/wiki/Sentinel-1).

2. **AMSR2 Data (view 2):** Alongside each Sentinel-1 image, we have corresponding data from the Advanced Microwave Scanning Radiometer 2 (AMSR2). This dataset contains information about the brightness temperatures of the Earth's surface. AMSR2 measures microwave radiation, which can be used to gather information about surface properties like sea ice concentration. if you want to know more about it here is the [link](https://www.ospo.noaa.gov/Products/atmosphere/gpds/about_amsr2.html).

To further analyze these datasets, we selected 10 files and divided the images into smaller patches, each patch 32 by 32 pixels. This patches allows us to focus on specific areas of interest within the Arctic and study them in detail. By combining the information from both Sentinel-1 and AMSR2 data, we can gain a comprehensive understanding of the Arctic environment and its sea ice patterns, which is crucial for various scientific and practical applications, including safe navigation in this challenging region.

view 1: Sentinel-1 image

<img alt="nersc_sar_primary view" src="nersc_sar_primary.jpg"/>

view 2: AMSR2 image

<img src="btemp_89_0h.jpg" alt = "btemp_89_0h view" >

Download the data.zip file from Canvas, and then execute the cell below to import the data. You can customize the directory name for the data if necessary.

In [2]:
def load_data_from_directories(view1_dir, view2_dir, labels_dir):
    """
    Load data from directories containing two views and corresponding labels.

    Parameters:
    - view1_dir (str): Path to the directory containing view 1 data files.
    - view2_dir (str): Path to the directory containing view 2 data files.
    - labels_dir (str): Path to the directory containing label data files.

    Returns:
    - view1_data (numpy.ndarray): NumPy array containing data from view 1.
    - view2_data (numpy.ndarray): NumPy array containing data from view 2.
    - labels_data (numpy.ndarray): NumPy array containing label data.

    This function loads data from two views and their corresponding labels, assuming a common "number" part
    in the file names for matching files. It ensures that data files from both views and labels are consistent
    and loads them into NumPy arrays for further processing.
    """
    # List all files in each directory
    files_view1 = os.listdir(view1_dir)
    files_view2 = os.listdir(view2_dir)
    files_label = os.listdir(labels_dir)

    # Initialize empty lists to store data from each view and labels
    view1_data = []
    view2_data = []
    labels_data = []

    # Iterate through the files in the directory
    for filename in files_view1:
        if filename.endswith('_samples_view1.npy'):
            # Extract the common "number" part of the file name
            common_number = filename.split('_')[0]

            # Check if corresponding files exist for view2 and labels
            if common_number + '_samples_view2.npy' in files_view2 and common_number + '_labels.npy' in files_label:
                # Load data from the NumPy files
                data_view1 = np.load(os.path.join(view1_dir, filename))
                data_view2 = np.load(os.path.join(view2_dir, common_number + '_samples_view2.npy'))
                data_labels = np.load(os.path.join(labels_dir, common_number + '_labels.npy'))

                # Append data to respective lists
                view1_data.append(data_view1)
                view2_data.append(data_view2)
                labels_data.append(data_labels)

    view1_data = np.array(view1_data)
    view2_data = np.array(view2_data)
    labels_data = np.array(labels_data)

    return view1_data, view2_data, labels_data


view1_dir = 'C:/Users/srina/Documents/Big Data science/data/view1/' 
view2_dir = 'C:/Users/srina/Documents/Big Data science/data/view2/' 
labels_dir = 'C:/Users/srina/Documents/Big Data science/data/labels/' 

view1_data, view2_data, labels_data = load_data_from_directories(view1_dir, view2_dir, labels_dir)


print(" shape view 1 data: ", view1_data.shape)
print(" shape view 2 data: ", view2_data.shape)
print(" shape labels data: ", labels_data.shape)


 shape view 1 data:  (13683, 32, 32)
 shape view 2 data:  (13683, 32, 32)
 shape labels data:  (13683, 1)


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part  1. Co-Training Models for Sea Ice Classification</div>

In this task, you'll be applying the Co-training technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available. You may want to revisit Lecture 02, which covers the topic of cotraining for a more comprehensive understanding

1-1. Divide the dataset into three distinct sets: one for labeled data, one for unlabeled data, and one for test data. Make sure that the labeled dataset contains between 100 and 130 data.

In [3]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def split_dataset(dataset_view1, dataset_view2, labeled_size, test_size=0.2, random_seed=42):
    """
    Split the dataset into labeled, unlabeled, and test sets.

    Parameters:
    - dataset (list or array-like): The input dataset to be split.
    - labeled_size (int): The target size for the labeled set (default: 130).
    - test_size (float): The proportion of the dataset to include in the test split (default: 0.2).
    - random_seed (int): Seed for reproducibility (default: None).

    Returns:
    - labeled_set_view1: Subset of the dataset with labeled data (approximately 100-130 points).
    - labeled_set_view2: Subset of the dataset with labeled data (approximately 100-130 points).
    - label_labeled_set: Labels corresponding to the labeled data points.
    - unlabeled_set: Subset of the dataset with unlabeled data.
    - test_set: Subset of the dataset for testing.
    - label_test_set: Labels corresponding to the test data points.
    """
    # First, let's create a common index for shuffling and splitting the data
    num_samples = len(dataset_view1)
    indices = np.arange(num_samples)

    # Shuffle the indices for random sampling
    np.random.seed(random_seed)
    np.random.shuffle(indices)

    # Determine the number of labeled data points

    # Calculate the number of labeled and test data points
    n_labeled = int(labeled_size)
    n_test = int(test_size * num_samples)

    # Split the data into labeled, unlabeled, and test sets
    labeled_indices = indices[:n_labeled]
    test_indices = indices[n_labeled:n_labeled + n_test]
    unlabeled_indices = indices[n_labeled + n_test:]

    labeled_set_view1 = dataset_view1[labeled_indices]
    labeled_set_view2 = dataset_view2[labeled_indices]
    label_labeled_set = labels_data[labeled_indices]  # You need to provide the labels

    unlabeled_set_view1 = dataset_view1[unlabeled_indices]
    unlabeled_set_view2 = dataset_view2[unlabeled_indices]

    test_set_view1 = dataset_view1[test_indices]
    test_set_view2 = dataset_view2[test_indices]
    label_test_set = labels_data[test_indices]  # You need to provide the labels

    return labeled_set_view1, labeled_set_view2, label_labeled_set, unlabeled_set_view1, unlabeled_set_view2, test_set_view1, test_set_view2, label_test_set

1-2. initialize two classifiers for each view using scikit-learn. Consider using a Convolutional Neural Network (CNN) as one of the classifiers and a Random Forest as the other.
Here's a short description of the configuration for the CNN (Convolutional Neural Network) and Random Forest (RF) classifiers to implement:

**CNN Classifier Configuration:**

1. Input Layer: BatchNormalization with input shape (32, 32, 1).
2. Convolutional Layer 1: 32 filters, each with a 3x3 kernel and ReLU activation.
3. Max Pooling Layer 1: 2x2 pooling with a stride of 2.
4. Convolutional Layer 2: 32 filters, each with a 3x3 kernel and ReLU activation.
5. Max Pooling Layer 2: 2x2 pooling with a stride of 2.
6. Convolutional Layer 3: 32 filters, each with a 3x3 kernel and ReLU activation.
7. Max Pooling Layer 3: 2x2 pooling with a stride of 2.
8. BatchNormalization Layer.
9. Flatten Layer.
10. Dropout Layer with a dropout rate of 0.1.
11. Fully Connected Layer 1: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
12. Dropout Layer with a dropout rate of 0.1.
13. Fully Connected Layer 2: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
14. Dropout Layer with a dropout rate of 0.1.
15. Output Layer: Dense layer with the number of neurons equal to the number of classes and softmax activation.

**CNN Model Compilation:**
- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Metrics for Evaluation: Accuracy

**Random Forest Classifier Configuration:**
- Number of Estimators: 20
- Random State: 42




In [4]:
num_classes = 6
# write your code here
#implement classifiers based on the provided definition
#cnn_classifier =
#rf_classifier =

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
import numpy as np

# Define the configuration for the CNN classifier
cnn_classifier = Sequential([
    Input(shape=(32, 32, 1)),
    BatchNormalization(),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    BatchNormalization(),
    Flatten(),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(num_classes, activation='softmax')
])

cnn_classifier.compile(optimizer='adam',
                       loss='sparse_categorical_crossentropy',
                       metrics=['accuracy'])


# Define the configuration for the Random Forest classifier
def rf_classifier(num_estimators=20, random_state=42):
    return RandomForestClassifier(n_estimators=num_estimators, random_state=random_state)



1-3. Do the co-training part:
   - Train classifiers on the labeled data
   - Predict on the unlabeled data and identify instances that have a confidence score more than 90.
   - Add the confident instances to the labeled set and train again
   - Compute the accuracy of the classifiers on test set
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).
<img src="accuracy.jpg" alt = "accuracy metric" >

In [22]:
 # Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def co_training(classifier1, classifier2, labeled_set_1, labeled_set_2, label_labeled_set, unlabeled_set_1, unlabeled_set_2, test_set_1, test_set_2 , label_test_set, threshold_confidence):
    """
    Perform co-training with two classifiers on labeled and unlabeled data.

    Parameters:
    - classifier1: The first classifier (e.g., CNN).
    - classifier2: The second classifier (e.g., Random Forest).
    - labeled_set (list or array-like): Labeled dataset.
    - unlabeled_set (list or array-like): Unlabeled dataset.
    - test_set (list or array-like): Test dataset.
    - threshold_confidence (float): The minimum confidence threshold for adding unlabeled samples to the training set.

    Returns:
    - classifier1_accuracy (float): Accuracy of Classifier 1 on the test set after co-training.
    - classifier2_accuracy (float): Accuracy of Classifier 2 on the test set after co-training.
    """
    
    samples_1, x1, y1 = labeled_set_2.shape
    labeled_set_2_2D = labeled_set_2.reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = unlabeled_set_2.shape
    unlabeled_set_2_2D = unlabeled_set_2.reshape((samples_2,x2*y2))
    
    samples_3, x3, y3 = test_set_2.shape
    test_set_2_2D = test_set_2.reshape((samples_3,x3*y3))
    
    classifier1.fit(labeled_set_1, label_labeled_set)
    classifier2.fit(labeled_set_2_2D, label_labeled_set)

    # Loop for co-training iterations
   
    # Predict on unlabeled data
    predictions1 = classifier1.predict(unlabeled_set_1)
    predictions2 = classifier2.predict(unlabeled_set_2_2D)

    # Confidence scores for the predictions
    confidences1 = np.max(classifier1.predict(unlabeled_set_1), axis=1)
    confidences2 = np.max(classifier2.predict_proba(unlabeled_set_2_2D), axis=1)

    # Identify instances with high confidence for both classifiers
    confident_samples_1 = np.where(confidences1 > threshold_confidence)[0]
    confident_samples_2 = np.where(confidences2 > threshold_confidence)[0]

    print(confident_samples_1.shape)
    print(confident_samples_1)
    # Add confident samples to the labeled set and remove them from the unlabeled set
    labeled_set_1 = np.vstack((labeled_set_1, unlabeled_set_1[confident_samples_1]))
    label_labeled_set_1 = np.append(label_labeled_set, predictions1[confident_samples_1])
        
    labeled_set_2_2D = np.vstack((labeled_set_2_2D, unlabeled_set_2_2D[confident_samples_2]))
    label_labeled_set_2 = np.append(label_labeled_set, predictions2[confident_samples_2])

    # Remove confident samples from the unlabeled set
    unlabeled_set_1 = np.delete(unlabeled_set_1, confident_samples_1, axis=0)
    unlabeled_set_2_2D = np.delete(unlabeled_set_2_2D, confident_samples_2, axis=0)

    # Retrain the classifiers with the updated labeled set
    classifier1.fit(labeled_set_1, label_labeled_set_1)
    classifier2.fit(labeled_set_2_2D, label_labeled_set_2)

    # Compute accuracy on the test set
    classifier1_accuracy = classifier1.evaluate(test_set_1, label_test_set)
    classifier2_accuracy = classifier2.score(test_set_2_2D, label_test_set)

    return classifier1_accuracy, classifier2_accuracy

1-4. pick one of the classifiers and do the supervised training with the labeled data and calculate the accuracy

In [23]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_2_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_3, x3, y3 = test_data.shape
    test_set_2_2D = test_data.reshape((samples_3,x3*y3))
    
    classifier.fit(labeled_set_2_2D, labeled_labels)
    accuracy = classifier.score(test_set_2_2D, test_labels)
    return accuracy

1-5. Compare the Co-training approach accuracy and supervised model with limited labeled data and write your reason about it.

In [24]:
# Define the number of classes in your dataset (update this accordingly)
num_classes = 6

# Split the dataset into labeled, unlabeled, and test sets
labeled_set_view1, labeled_set_view2, label_labeled_set, unlabeled_set_view1, unlabeled_set_view2, test_set_view1, test_set_view2, label_test_set = split_dataset(view1_data, view2_data, labeled_size= 120)


classifier_rf = rf_classifier(num_estimators=20, random_state=42)

# Co-training
cnn_accuracy, rf_accuracy = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_view1, labeled_set_2=labeled_set_view2,label_labeled_set=label_labeled_set,
                                        unlabeled_set_1=unlabeled_set_view1, unlabeled_set_2=unlabeled_set_view2,
                                        test_set_1=test_set_view1, test_set_2=test_set_view2, label_test_set=label_test_set,
                                        threshold_confidence= 0.90)

# Supervised training with one of the classifiers (e.g., Random Forest)
rf_supervised_accuracy = supervised_training_and_accuracy(classifier_rf, labeled_set_view2, label_labeled_set, test_set_view2, label_test_set)

# Compare accuracies
print(f"Co-training CNN Accuracy: {cnn_accuracy}")
print(f"Co-training Random Forest Accuracy: {rf_accuracy}")
print(f"Supervised Random Forest Accuracy: {rf_supervised_accuracy}")



  classifier2.fit(labeled_set_2_2D, label_labeled_set)


(0,)
Co-training CNN Accuracy: [1.568174123764038, 0.8146929740905762]
Co-training Random Forest Accuracy: 0.893640350877193
Supervised Random Forest Accuracy: 0.8636695906432749


  classifier.fit(labeled_set_2_2D, labeled_labels)


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 2. Label Propagation for Sea Ice Classification</div>


In this task, you'll be applying the label propagation technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available.


<img src="label_propagation.jpg" alt = "label propgation process" >

 2-1. Apply the K-Nearest Neighbors (KNN) algorithm with a parameter configuration where n_neighbors is set to 7 for the label propagation model. Utilize one of the labeled data views and the corresponding unlabeled data from part 1 as input.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [8]:
from sklearn.semi_supervised import LabelPropagation
from sklearn.neighbors import KNeighborsClassifier

# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def label_propagation(labeled_data, unlabeled_data, labeled_labels, test_data, label_test, n_neighbors):
    """
    Apply K-Nearest Neighbors (KNN) to the label propagation model on one data view and test data.

    Parameters:
    - labeled_data (array-like): Labeled data points.
    - unlabeled_data (array-like): Unlabeled data points.
    - labeled_labels (array-like): Labels corresponding to the labeled data points.
    - test_data (array-like): Test data to evaluate label propagation performance.
    - label_test (array-like): Labels corresponding to the test data points.
    - n_neighbors (int): Number of neighbors to consider in KNN (default: 7).

    Returns:
    - accuracy (float): Accuracy of label propagation on the test data.
    """
    
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_1_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = unlabeled_data.shape
    unlabeled_set_1_2D = unlabeled_data.reshape((samples_2,x2*y2))
    
    samples_3, x3, y3 = test_data.shape
    test_set_1_2D = test_data.reshape((samples_3,x3*y3))
    
    label_prop_model = LabelPropagation(kernel="knn", n_neighbors=n_neighbors, max_iter=1000)
    
    # Fit the model
    label_prop_model.fit(labeled_set_1_2D, labeled_labels)
    
    # Predict labels for the test data
    predicted_labels = label_prop_model.predict(unlabeled_set_1_2D)
    
    data_combined = np.vstack([labeled_set_1_2D, unlabeled_set_1_2D])
    
    all_labels = np.concatenate([labeled_labels.flatten(), predicted_labels])
    
    knn_classifier = KNeighborsClassifier(n_neighbors=n_neighbors)

    # Fit the KNN classifier on the labeled data
    knn_classifier.fit(data_combined, all_labels)

    # Predict labels for the test data
    test_predicted_labels = knn_classifier.predict(test_set_1_2D)
    
    # Calculate accuracy
    accuracy = accuracy_score(label_test, test_predicted_labels)
    
    return accuracy

label_prop_accuracy = label_propagation(labeled_set_view1, unlabeled_set_view1, label_labeled_set, test_set_view1, label_test_set, n_neighbors=7)

print(f"KNN Accuracy for label propogation: {label_prop_accuracy}")

  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation: 0.8088450292397661


2-2. Select a classification algorithm and perform supervised learning on the labeled set. Then, evaluate the model's performance by calculating the accuracy. You can use a built-in library for the classifier. Compare your sepervised and semi supervised accuracy.

In [9]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    
    samples_1, x1, y1 = labeled_data.shape
    labeled_set_1_2D = labeled_data.reshape((samples_1,x1*y1))
    
    samples_2, x2, y2 = test_data.shape
    test_data_1_2D = test_data.reshape((samples_2,x2*y2))
    
    classifier.fit(labeled_set_1_2D, labeled_labels)
    
    # Predict on test data
    predictions = classifier.predict(test_data_1_2D)
    
    # Calculate accuracy
    accuracy = accuracy_score(test_labels, predictions)
    
    return accuracy

rnd_f_accuracy = supervised_training_and_accuracy(classifier_rf, labeled_set_view1, label_labeled_set, test_set_view1, label_test_set)

print(f"Supervised training accuracy: {rnd_f_accuracy}")
    

Supervised training accuracy: 0.8636695906432749


  classifier.fit(labeled_set_1_2D, labeled_labels)


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 3. Now let's perform some experimentation and make some observations!</div>



3-1. We will explore the impact of varying the threshold confidence in the co-training process at three different values: 80, 70, and 60. We will then assess the accuracy of co-training based on these threshold settings.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [10]:
# your code here

cnn_accuracy_1, rf_accuracy_1 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_view1, labeled_set_2=labeled_set_view2,label_labeled_set=label_labeled_set,
                                        unlabeled_set_1=unlabeled_set_view1, unlabeled_set_2=unlabeled_set_view2,
                                        test_set_1=test_set_view1, test_set_2=test_set_view2, label_test_set=label_test_set,
                                        threshold_confidence= 0.80)

cnn_accuracy_2, rf_accuracy_2 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_view1, labeled_set_2=labeled_set_view2,label_labeled_set=label_labeled_set,
                                        unlabeled_set_1=unlabeled_set_view1, unlabeled_set_2=unlabeled_set_view2,
                                        test_set_1=test_set_view1, test_set_2=test_set_view2, label_test_set=label_test_set,
                                        threshold_confidence= 0.70)

cnn_accuracy_3, rf_accuracy_3 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_view1, labeled_set_2=labeled_set_view2,label_labeled_set=label_labeled_set,
                                        unlabeled_set_1=unlabeled_set_view1, unlabeled_set_2=unlabeled_set_view2,
                                        test_set_1=test_set_view1, test_set_2=test_set_view2, label_test_set=label_test_set,
                                        threshold_confidence= 0.60)

print(f"Co-training of CNN Accuracy: {cnn_accuracy_1} and Random Forest Accuracy: {rf_accuracy_1}")
print(f"Co-training of CNN Accuracy: {cnn_accuracy_2} and Random Forest Accuracy: {rf_accuracy_2}")
print(f"Co-training of CNN Accuracy: {cnn_accuracy_3} and Random Forest Accuracy: {rf_accuracy_3}")





  classifier2.fit(labeled_set_2_2D, label_labeled_set)




  classifier2.fit(labeled_set_2_2D, label_labeled_set)




  classifier2.fit(labeled_set_2_2D, label_labeled_set)


Co-training of CNN Accuracy: [1.8210883140563965, 0.09393274784088135] and Random Forest Accuracy: 0.893640350877193
Co-training of CNN Accuracy: [1.8118579387664795, 0.3804824650287628] and Random Forest Accuracy: 0.893640350877193
Co-training of CNN Accuracy: [1.7965501546859741, 0.7335526347160339] and Random Forest Accuracy: 0.893640350877193


3-2. Change the parameters of the K-Nearest Neighbors (KNN) algorithm for Label Propagation (part 2) with the values 3, 5, and 10, and explain what you understand about these parameter adjustments.


In [11]:
# your code here
label_prop_accuracy_1 = label_propagation(labeled_set_view1, unlabeled_set_view1, label_labeled_set, test_set_view1, label_test_set, n_neighbors=3)
label_prop_accuracy_2 = label_propagation(labeled_set_view1, unlabeled_set_view1, label_labeled_set, test_set_view1, label_test_set, n_neighbors=5)
label_prop_accuracy_3 = label_propagation(labeled_set_view1, unlabeled_set_view1, label_labeled_set, test_set_view1, label_test_set, n_neighbors=10)


print(f"KNN Accuracy for label propogation for 3: {label_prop_accuracy_1}")
print(f"KNN Accuracy for label propogation for 5: {label_prop_accuracy_2}")
print(f"KNN Accuracy for label propogation for 10: {label_prop_accuracy_3}")

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation for 3: 0.8212719298245614
KNN Accuracy for label propogation for 5: 0.8212719298245614
KNN Accuracy for label propogation for 10: 0.8088450292397661




3-3. Let's see the impact of of a simplifies models for cotraining approach.
- Reduce the number of convolutional layers in the question 1-2 from 3 to 1 convolution layer and the rest of the layers is the same
- Change the number of trees for the random forest algorithm to 1.
Evaluate the performance of cotraining approach.
- Additionally, use the 1 layer convolution layer as the supervised model and evaluate the performance for supervised learning.


In [12]:
# your code here

cnn_classifier_1 = Sequential([
    Input(shape=(32, 32, 1)),
    BatchNormalization(),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    BatchNormalization(),
    Flatten(),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(num_classes, activation='softmax')
])

cnn_classifier_1.compile(optimizer='adam',
                       loss='sparse_categorical_crossentropy',
                       metrics=['accuracy'])

classifier_rf_1 = rf_classifier(num_estimators=1, random_state=42)


cnn_acc, rf_acc = co_training(classifier1=cnn_classifier_1 , classifier2=classifier_rf_1, 
                                        labeled_set_1=labeled_set_view1, labeled_set_2=labeled_set_view2,label_labeled_set=label_labeled_set,
                                        unlabeled_set_1=unlabeled_set_view1, unlabeled_set_2=unlabeled_set_view2,
                                        test_set_1=test_set_view1, test_set_2=test_set_view2, label_test_set=label_test_set,
                                        threshold_confidence= 0.90)

print(f"Co-training of CNN Accuracy: {cnn_acc} and Random Forest Accuracy: {rf_acc}")

def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    classifier.fit(labeled_data, labeled_labels)
    accuracy = classifier.evaluate(test_data, test_labels)
    return accuracy

supervised_acc = supervised_training_and_accuracy(cnn_classifier_1, labeled_set_view1, label_labeled_set, test_set_view1, label_test_set)

print(f"Supervised 1 layer CNN Accuracy: {supervised_acc}")





  classifier2.fit(labeled_set_2_2D, label_labeled_set)


Co-training of CNN Accuracy: [1.8911731243133545, 0.18932747840881348] and Random Forest Accuracy: 0.8154239766081871
Supervised 1 layer CNN Accuracy: [1.8017785549163818, 0.1940789520740509]


3-4. Let's adjust the amount of labeled data in part 2 by considering two different quantities: 200 and 400 labeled data points. In each scenario, the remaining data will remain unlabeled. Evaluate the performance of label propagation under these labeled data scenarios.


In [13]:
# your code here
labeled_set_view3, labeled_set_view4, label_labeled_set2, unlabeled_set_view3, unlabeled_set_view4, test_set_view3, test_set_view4, label_test_set2 = split_dataset(view1_data, view2_data, labeled_size=200)

labeled_set_view5, labeled_set_view6, label_labeled_set3, unlabeled_set_view5, unlabeled_set_view6, test_set_view5, test_set_view6, label_test_set3 = split_dataset(view1_data, view2_data, labeled_size=400)

label_prop_accuracy2 = label_propagation(labeled_set_view3, unlabeled_set_view3, label_labeled_set2, test_set_view3, label_test_set2, n_neighbors=7)

label_prop_accuracy3 = label_propagation(labeled_set_view5, unlabeled_set_view5, label_labeled_set3, test_set_view5, label_test_set3, n_neighbors=7)

print(f"KNN Accuracy for label propogation for labeled size of 200: {label_prop_accuracy2}")
print(f"KNN Accuracy for label propogation for labeled size of 400: {label_prop_accuracy3}")


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


KNN Accuracy for label propogation for labeled size of 200: 0.8870614035087719
KNN Accuracy for label propogation for labeled size of 400: 0.8863304093567251



3-5. Let's adjust the number of labeled data samples for part 1. Consider three scenarios: one with 200 labeled samples, another with 400 labeled samples, and a third with 600 labeled samples. In each scenario, the remaining data will remain unlabeled. Additionally, include an explanation of your understanding of how these parameter changes impact the algorithm


In [14]:
# your code here
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def updated_split_dataset(dataset_view1, dataset_view2, labeled_size, random_seed=42):
    num_samples = len(dataset_view1)
    indices = np.arange(num_samples)

    np.random.seed(random_seed)
    np.random.shuffle(indices)

    num_labeled = int(labeled_size)

    labeled_indices = indices[:num_labeled]
    unlabeled_indices = indices[num_labeled:]

    labeled_set_view1 = dataset_view1[labeled_indices]
    labeled_set_view2 = dataset_view2[labeled_indices]
    label_labeled_set = labels_data[labeled_indices]  

    unlabeled_set_view1 = dataset_view1[unlabeled_indices]
    unlabeled_set_view2 = dataset_view2[unlabeled_indices]

    return labeled_set_view1, labeled_set_view2, label_labeled_set, unlabeled_set_view1, unlabeled_set_view2

labeled_set_view7, labeled_set_view8, label_labeled_set4, unlabeled_set_view7, unlabeled_set_view8 = updated_split_dataset(view1_data, view2_data, labeled_size= 200)

labeled_set_view9, labeled_set_view10, label_labeled_set5, unlabeled_set_view9, unlabeled_set_view10 = updated_split_dataset(view1_data, view2_data, labeled_size= 400)

labeled_set_view11, labeled_set_view12, label_labeled_set6, unlabeled_set_view11, unlabeled_set_view12 = updated_split_dataset(view1_data, view2_data, labeled_size= 600)


In [15]:
print(labeled_set_view7.shape)
print(unlabeled_set_view7.shape)
print(labeled_set_view9.shape)
print(unlabeled_set_view10.shape)
print(labeled_set_view11.shape)
print(unlabeled_set_view12.shape)

(200, 32, 32)
(13483, 32, 32)
(400, 32, 32)
(13283, 32, 32)
(600, 32, 32)
(13083, 32, 32)



3-6. Evalute the perfomance for different number of unlabeled data size.
- Set labeled data size within the range of 100 to 130 and
- Set the unlabeled data sizes at 200, 400, and 600.
- Execute the algorithms and provide accuracy reports for both approaches: co-training and label propagation.


In [16]:
def Reupdated_split_dataset(dataset_view1, dataset_view2, labeled_size, unlabeled_size, random_seed=42):

    # First, let's create a common index for shuffling and splitting the data
    num_samples = len(dataset_view1)
    indices = np.arange(num_samples)

    # Shuffle the indices for random sampling
    np.random.seed(random_seed)
    np.random.shuffle(indices)

    # Determine the number of labeled data points
    labeled_size = int(labeled_size)

    # Calculate the number of labeled and test data points
    num_labeled = int(labeled_size)
    num_unlabeled = int(unlabeled_size)

    # Split the data into labeled, unlabeled, and test sets
    labeled_indices = indices[:num_labeled]
    unlabeled_indices = indices[num_labeled: num_labeled + num_unlabeled]
    test_indices = indices[num_unlabeled:]

    labeled_set_view1 = dataset_view1[labeled_indices]
    labeled_set_view2 = dataset_view2[labeled_indices]
    label_labeled_set = labels_data[labeled_indices]  # You need to provide the labels

    unlabeled_set_view1 = dataset_view1[unlabeled_indices]
    unlabeled_set_view2 = dataset_view2[unlabeled_indices]

    test_set_view1 = dataset_view1[test_indices]
    test_set_view2 = dataset_view2[test_indices]
    label_test_set = labels_data[test_indices] 
    
    
    return labeled_set_view1, labeled_set_view2, label_labeled_set, unlabeled_set_view1, unlabeled_set_view2, test_set_view1, test_set_view2, label_test_set


In [17]:
labeled_set_1, labeled_set_2, labeled_1, unlabeled_set_1, unlabeled_set_2, test_set_1, test_set_2, test_label_set_1 = Reupdated_split_dataset(view1_data, view2_data, labeled_size= 120, unlabeled_size= 200)

labeled_set_3, labeled_set_4, labeled_2, unlabeled_set_3, unlabeled_set_4, test_set_3, test_set_4, test_label_set_2 = Reupdated_split_dataset(view1_data, view2_data, labeled_size= 120, unlabeled_size= 400)

labeled_set_5, labeled_set_6, labeled_3, unlabeled_set_5, unlabeled_set_6, test_set_5, test_set_6, test_label_set_3 = Reupdated_split_dataset(view1_data, view2_data, labeled_size= 120, unlabeled_size= 600)

cnn_acc_1, rf_acc_1 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_1, labeled_set_2=labeled_set_2,label_labeled_set=labeled_1,
                                        unlabeled_set_1=unlabeled_set_1, unlabeled_set_2=unlabeled_set_2,
                                        test_set_1=test_set_1, test_set_2=test_set_2, label_test_set=test_label_set_1,
                                        threshold_confidence= 0.90)

l_prop_accuracy_1 = label_propagation(labeled_set_1, unlabeled_set_1, labeled_1, test_set_1, test_label_set_1, n_neighbors=7)


cnn_acc_2, rf_acc_2 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_3, labeled_set_2=labeled_set_4,label_labeled_set=labeled_2,
                                        unlabeled_set_1=unlabeled_set_3, unlabeled_set_2=unlabeled_set_4,
                                        test_set_1=test_set_3, test_set_2=test_set_4, label_test_set=test_label_set_2,
                                        threshold_confidence= 0.90)

l_prop_accuracy_2 = label_propagation(labeled_set_3, unlabeled_set_3, labeled_2, test_set_3, test_label_set_2, n_neighbors=7)


cnn_acc_3, rf_acc_3 = co_training(classifier1=cnn_classifier , classifier2=classifier_rf, 
                                        labeled_set_1=labeled_set_5, labeled_set_2=labeled_set_6,label_labeled_set=labeled_3,
                                        unlabeled_set_1=unlabeled_set_5, unlabeled_set_2=unlabeled_set_6,
                                        test_set_1=test_set_5, test_set_2=test_set_6, label_test_set=test_label_set_3,
                                        threshold_confidence= 0.90)

l_prop_accuracy_3 = label_propagation(labeled_set_5, unlabeled_set_5, labeled_3, test_set_5, test_label_set_3, n_neighbors=7)


print(f"Co-training of unlabeled 200 of CNN Accuracy: {cnn_acc_1} and Random Forest Accuracy: {rf_acc_1}")
print(f"KNN Accuracy for label propogation for 200 unlabeled size: {l_prop_accuracy_1}")
print(f"Co-training of unlabeled 400 of CNN Accuracy: {cnn_acc_2} and Random Forest Accuracy: {rf_acc_2}")
print(f"KNN Accuracy for label propogation for 400 unlabeled size: {l_prop_accuracy_2}")
print(f"Co-training of unlabeled 600 of CNN Accuracy: {cnn_acc_3} and Random Forest Accuracy: {rf_acc_3}")
print(f"KNN Accuracy for label propogation for 600 unlabeled size: {l_prop_accuracy_3}")


1/7 [===>..........................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, label_labeled_set)




  y = column_or_1d(y, warn=True)


 1/13 [=>............................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, label_labeled_set)




  y = column_or_1d(y, warn=True)


 1/19 [>.............................] - ETA: 0s

  classifier2.fit(labeled_set_2_2D, label_labeled_set)




  y = column_or_1d(y, warn=True)


Co-training of unlabeled 200 of CNN Accuracy: [1.7773396968841553, 0.7401171922683716] and Random Forest Accuracy: 0.8863012682637396
KNN Accuracy for label propogation for 200 unlabeled size: 0.8091670993102426
Co-training of unlabeled 400 of CNN Accuracy: [1.743417501449585, 0.7403448224067688] and Random Forest Accuracy: 0.8863208612512233
KNN Accuracy for label propogation for 400 unlabeled size: 0.8090039900624859
Co-training of unlabeled 600 of CNN Accuracy: [1.7109112739562988, 0.740273654460907] and Random Forest Accuracy: 0.8864174883436521
KNN Accuracy for label propogation for 600 unlabeled size: 0.8087594588397157


<h2>Submission</h2>

<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>

<p style="text-align: justify;">You need to submit a Jupyter Notebook (*.ipynb) file that contains your completed code.


<span>The file name should be in <strong>FirstName_LastName</strong> format</span>.</p>
<p style="text-align: justify;"><span>DO NOT INCLUDE EXTRA FILES, SUCH AS THE INPUT DATASETS</span>, in your submission;</p>
<p style="text-align: justify;">Please download your assignment after submission and make sure it is not corrupted or empty! We will not be responsible for corrupted submissions and will not take a resubmission after the deadline.</p>

Need Help?
If you need help with this assignment, please get in touch with TAs via their emails, or go to their office hours.
You are highly encouraged to ask your question on the designated channel for Assignment o on Microsoft Teams (not necessarily monitored by the instructor/TAs). Feel free to help other students with general questions. However, DO NOT share your solution.<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>