# 🎉 Out-of-Distribution (OOD) with PCA in Image Processing

The goal of this notebook is to understand the depths of using Principal Component Analysis in order to perform OOD tasks in the domain of image processing.

## 📝 Plan of action

### ♻️ Preprocessing phase

In order to achieve our goal, we need to understand how the dataset is structured.

For this notebook, we are going to use the CBIR 15 dataset, that contains images of different places, such as an office, a bedroom, a mountain, etc. Note that there are some places that are similar one to another, i.e. a bedroom and a living room.

Thus, in order to extract the features of the images we have to preprocess those images:

- Get the images that are located in data/CBIR_15-scene and fit them to a dataframe using Pandas
  - Locate the "Labels.txt" file: it shows where the indexes of the images from each category starts
- Create the dataset with this information with two columns: the path to the image and its category
- Transform all of the images in the same size (in this case, we are going with 256x256)
  
Now, in order to extract the features, it's necessary to divide the reshaped images into patches of 32x32 pixels. This is good to perform processing tasks to avoid waiting long periods of time.

After all the preprocess, we should separate the images into two different foldes: one contains the patches of the training images that is going to give us their principal components and dimensions, and the other is the patches of the test images, that is going to be tested to fit into those dimensions and we'll get an OOD score afterwards.

### 🏋🏽‍♂️ Training phase

With the images that are stored inside the "patches_train" folder, the first thing we are going to do is _normalize_ all of the images to find the correct maximum covariance and transforming all the variables into the same scale.

Next, we should then apply the PCA with all the components. As we have patches of 32x32, we'll be having 1024 features, hence components. Then we plot a graph to see how many components truly contributes for the most variance of the data - and give us more information about it. We're going to take the threshold of 95% of variance in this notebook.

After getting the PCA with components that describe 95% of the variance, it's time to test our images and see how far of the residual space their data can be found.

### ⚗️ Test phase and results

In this phase, we take the test images and normalize then with the same scale of each PCA. This is important to maintain consistency throughout the final results and measure the norms in the new dimension properly.

After that, we calculate the norm of the projection of the given data into the orthogonal space of the principal component and divide it by the norm of the data in relation to the origin. This is the OOD score.

We calculate the mean of the score for each category and get the minimal one. The current environment is the smallest.


--------------------------

First of all, we need to understand which libraries we are going to use:

- os: Deals with the operation system interface such as finding the relative and absolute path of files inside a project and reading/writing files for example.
- sys: This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.
- numpy: NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
- pandas: Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- matplotlib: Deals with plotting graphs to visualize data in a graphical way.
- sklearn: Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators.

In [None]:
import os
import sys
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


I'd suggest to use a conda virtual environment in order to avoid messing up your base kernel environment and causing dependency errors in the future.

After you successfully installed all the modules, it's time to import our custom modules that are going to deal with:

- Creation of our dataframe using pandas
- Separation of our dataset into patches of 32x32 in folders of training and test

In [None]:

sys.path.append(os.path.abspath('..'))

from dataframe_generator import *
from image_patching import *

In [None]:
import tarfile

def extract_tgz(tgz_path, extract_to):
    if not os.path.exists(extract_to):
        os.makedirs(extract_to)
    
    with tarfile.open(tgz_path, 'r:gz') as tar:
        tar.extractall(path=extract_to)
        print(f"Arquivos extraídos para {extract_to}")

tgz_path = '../CBIR_15-Scene.tgz'
extract_to = '../data/'

extract_tgz(tgz_path, extract_to)

In [None]:
df = create_dataframe()
patch_size = (32,32)
standard_size = (256, 256)
df

## ☝️ Part I: Comparing two different environments

### ♻️ Preprocessing phase

Now we start our experiments to understand if our idea work, however this time we are going to understand what happens with our approach using two different environments.

In our case, I'm going to take the **Coast** and **Office** environments arbitrarily.


In [None]:
train_categories = ['Coast', 'Office']

df_different = df[df['category'].isin(train_categories)]
df_different

It's time to separate our dataset into train and test. We should use the built-in function of sklearn to do this:

In [None]:
X = df_different['image_path']
y = df_different['category']
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=10)


Making sure that everything went well, we plot the grid of all the patches from the first image of our training set

In [None]:
def plot_patches_grid(patches, patch_size, grid_shape):
    fig, axes = plt.subplots(grid_shape[0], grid_shape[1], figsize=(10, 10))
    patch_idx = 0
    for i in range(grid_shape[0]):
        for j in range(grid_shape[1]):
            if patch_idx < len(patches):
                axes[i, j].imshow(cv2.cvtColor(patches[patch_idx], cv2.COLOR_BGR2RGB))
                axes[i, j].axis('off')
                patch_idx += 1
            else:
                axes[i, j].axis('off')
    plt.tight_layout()
    plt.show()

patch_size = (32,32)
standard_size = (256, 256)

first_image_path = X_train.iloc[0]
image = cv2.imread(first_image_path)
resized_image = resize_image(image, standard_size)
patches, positions = create_patches(resized_image, patch_size)

grid_rows = resized_image.shape[0] // patch_size[0]
grid_cols = resized_image.shape[1] // patch_size[1]

plot_patches_grid(patches, patch_size, (grid_rows, grid_cols))


This is exactly what the module that's inside our "image_patching.py" do. So we now, need to save everything into the subfolders by calling that function:

In [None]:
create_images_set(X_train, X_test, y_train, y_test, patch_size, output_dir_train='patches_train', output_dir_test='patches_test')

In [None]:
def load_patches_by_category(base_dir, categories):
    patches_by_category = {}
    
    for category in categories:
        category_patches = {}
        category_dir = os.path.join(base_dir, str(category))
        
        for root, _, files in os.walk(category_dir):
            files = [f for f in files if f.endswith('.png') and '_patch_' in f]
            files = sorted(files, key=lambda x: (int(x.split('_')[1]), int(x.split('_')[3]), int(x.split('_')[4].split('.')[0])))

            for filename in files:
                try:
                    parts = filename.split('_')
                    image_id = int(parts[1])
                    y = int(parts[3])
                    x = int(parts[4].split('.')[0])
                    patch = cv2.imread(os.path.join(root, filename), cv2.IMREAD_GRAYSCALE)
                    if patch is not None:
                        if image_id not in category_patches:
                            category_patches[image_id] = ([], [])
                        category_patches[image_id][0].append(patch.flatten())
                        category_patches[image_id][1].append((y, x))
                except (IndexError, ValueError) as e:
                    print(f"Error processing file {filename}: {e}")
                    continue

        patches_by_category[category] = category_patches
    
    return patches_by_category


Now, we should load our patches for training:

In [None]:
training_patches_by_category = load_patches_by_category('patches_train', y)

### 🏋🏽‍♂️ Training phase

Now that the have our training patches stored in that variable above, we should start our analysis with PCA.

First of all, we **need to normalize and center** the data. It's so importantt that I had to emphasize it. Plus, since we are dealing with different categories, each one of them should be normalized with a different scaler (and we're going to save it for later).

In [None]:
def center_patches(patches):
    return patches - patches.mean(axis=0)

centered_training_patches_by_category = {}
for category, images in training_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    centered_training_patches_by_category[category] = centered_images

print(centered_training_patches_by_category['Office'][list(centered_training_patches_by_category['Office'].keys())[0]][0].shape)
print(centered_training_patches_by_category['Coast'][list(centered_training_patches_by_category['Coast'].keys())[0]][0].shape)

In [None]:
for category, images in centered_training_patches_by_category.items():
    sorted_ids = sorted(images.keys())
    print(f"Sorted Image IDs for category {category}: {sorted_ids}")

In [None]:
import os
import numpy as np
import cv2

def reassemble_image_from_patches(patches, positions, original_image_shape, patch_size):
    reconstructed_image = np.zeros(original_image_shape, dtype=np.float32)
    patch_height, patch_width = patch_size

    for patch, (i, j) in zip(patches, positions):
        if i + patch_height <= original_image_shape[0] and j + patch_width <= original_image_shape[1]:
            patch = patch.reshape((patch_height, patch_width))
            reconstructed_image[i:i + patch_height, j:j + patch_width] = patch

    return reconstructed_image

def save_reconstructed_image(reconstructed_image, save_path):
    reconstructed_image_uint8 = np.clip(reconstructed_image, 0, 255).astype(np.uint8) #perdre d'info avec ça
    cv2.imwrite(save_path, reconstructed_image_uint8)

categories = ["Coast", "Office"]
base_input_dir = "patches_train"
base_output_dir = "different_original_images_raw"

patch_size = (32, 32)
original_image_shape = (256, 256)

training_patches_by_category = load_patches_by_category(base_input_dir, categories)

for category, image_patches in training_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        centered_patches = center_patches(np.array(patches))
        reconstructed_image = reassemble_image_from_patches(centered_patches, positions, original_image_shape, patch_size)
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


We have 214 training images for 'Office' with 64 patches each, then we should have 13696 patche in total.
Similarly, we have 356 training images for 'Coast', we should have 22784 patches in total.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot_distribution(patches, title):
    if isinstance(patches, list):
        patches = np.array(patches)
    plt.figure(figsize=(10, 6))
    plt.hist(patches.flatten(), bins=50, alpha=0.75)
    plt.title(title)
    plt.xlabel('Patch value')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

for category, images in training_patches_by_category.items():
    count = 0
    for image_id, (patches, positions) in images.items():
        if count >= 3:
            break
        plot_distribution(patches, f'Original data distribution - {category} (Image ID: {image_id})')
        centered_patches = centered_training_patches_by_category[category][image_id][0]
        plot_distribution(centered_patches, f'Centered data distribution - {category} (Image ID: {image_id})')
        count += 1


We see by analysing the distributions above that the StandardScaler has successfully normalized our data between 0 and 1.

Now let's find the PCA for the patches. Since each patch has 32x32 (1024) pixels, we're assuming that this is the initial number of components.

In [None]:
def apply_all_components_pca(patches_by_category, n_components=1024):
    all_components_pca_by_category = {}
    
    for category, images in patches_by_category.items():
        all_patches = []
        for image_id, (patches, positions) in images.items():
            all_patches.append(patches)
        all_patches = np.vstack(all_patches) 

        if all_patches.size == 0:
            continue
        
        pca = PCA(n_components=n_components)
        pca.fit(all_patches)
        all_components_pca_by_category[category] = pca
    return all_components_pca_by_category

In [None]:
def apply_reduced_pca(patches_by_category, n_components=1024, number_variance=0.95):
    pca_by_category = {}
    num_components_reduced_dict = {}
    
    for category, images in patches_by_category.items():
        all_patches = []
        for image_id, (patches, positions) in images.items():
            all_patches.append(patches)
        all_patches = np.vstack(all_patches)  
        if all_patches.size == 0:
            continue
        
        pca = PCA(n_components=n_components)
        pca.fit(all_patches)
        
        cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
        num_components_reduced = np.where(cumulative_variance >= number_variance)[0][0] + 1
        
        pca = PCA(n_components=num_components_reduced)
        pca.fit(all_patches)
        
        pca_by_category[category] = pca
        num_components_reduced_dict[category] = num_components_reduced

    min_num_components = min(num_components_reduced_dict.values())
    return pca_by_category, num_components_reduced_dict, min_num_components

In [None]:
def visualize_pca_components(pca_by_category, num_components_dict):
    for category, pca in pca_by_category.items():
        num_components = num_components_dict[category]
        components = pca.components_

        n_rows = (num_components // 20) + (1 if num_components % 20 != 0 else 0)
        n_cols = min(num_components, 20)

        fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 0.8))
        for i, ax in enumerate(axes.flatten()):
            if i < num_components:
                ax.imshow(components[i].reshape(32, 32), cmap='gray')
            ax.axis('off')
        plt.suptitle(f'{num_components} Principal Components - Category: {category}')
        plt.tight_layout(rect=[0, 0, 1, 0.96])
        plt.show()

        plt.figure(figsize=(10, 6))
        plt.plot(np.log(pca.explained_variance_[:num_components]))
        plt.title(f'Log-Variance of Principal Components - Category: {category}')
        plt.xlabel('Index of the Principal Component')
        plt.ylabel('Log-Variance')
        plt.grid(True)
        plt.show()

        print("Category: " + category)
        print(f"Number of components: {num_components}")

In [None]:
all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)
num_components_all_dict = {category: 1024 for category in centered_training_patches_by_category}
visualize_pca_components(all_components_pca_by_category, num_components_all_dict)


In [None]:
reduced_components_pca_by_category_95, num_components_reduced_dict_95, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.95)
reduced_components_pca_by_category_90, num_components_reduced_dict_90, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.90)
reduced_components_pca_by_category_85, num_components_reduced_dict_85, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.85)
reduced_components_pca_by_category_80, num_components_reduced_dict_80, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.80)

In [None]:
def print_num_components(num_components_reduced_dict, variance_level):
    print(f"\nNumber of Components for {variance_level*100}% Variance Explained:")
    for category, num_components in num_components_reduced_dict.items():
        print(f"Category: {category}, Number of Components: {num_components}")

# Imprimir o número de componentes para cada nível de variância explicada
print_num_components(num_components_reduced_dict_95, 0.95)
print_num_components(num_components_reduced_dict_90, 0.90)
print_num_components(num_components_reduced_dict_85, 0.85)
print_num_components(num_components_reduced_dict_80, 0.80)

So now, we see that for the "Coast" we only need 23 from 1024 components to explain 95% of the variance in the patch. For the "Office", it is reduced to 22.

The variable that stores the PCA is populated with the PCA that has these minimal description components number.

In [None]:
def project_and_reconstruct_patches(pca_by_category, centered_patches_by_category):
    reconstructed_patches_by_category = {}

    for category, pca in pca_by_category.items():
        images = centered_patches_by_category[category]
        reconstructed_images = {}
        
        for image_id, (centered_patches, positions) in images.items():
            projected = pca.transform(centered_patches)
            reconstructed_patches = pca.inverse_transform(projected)
            reconstructed_images[image_id] = (reconstructed_patches, positions)
        
        reconstructed_patches_by_category[category] = reconstructed_images
    
    return reconstructed_patches_by_category

all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)

reconstructed_patches_by_category = project_and_reconstruct_patches(all_components_pca_by_category, centered_training_patches_by_category)


In [None]:
base_output_dir = "different_all_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


In [None]:
reduced_reconstructed_patches_by_category = project_and_reconstruct_patches(reduced_components_pca_by_category_90, centered_training_patches_by_category)

In [None]:
base_output_dir = "different_reduced_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reduced_reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


We can also check it visually, by reconstructing the images and assuring that the main features are somewhat relatable to the original image:

## ⚗️ Test phase

Now we should start our experimenting phas.
We bein by loading the test dataset.

In [None]:
test_patches_by_category = load_patches_by_category('patches_test', y)

In [None]:
import matplotlib.pyplot as plt
import math

# Definir a categoria
category = 'Coast'

# Verificar se a categoria existe nos patches carregados
if category in test_patches_by_category:
    # Obter os patches da categoria
    category_patches = test_patches_by_category[category]
    
    # Obter o primeiro image_id e seus patches correspondentes
    if category_patches:
        first_image_id = next(iter(category_patches))
        patches, _ = category_patches[first_image_id]
        
        # Verificar se há patches
        if patches:
            # Definir o tamanho da grid
            num_patches = len(patches)
            grid_size = int(math.ceil(math.sqrt(num_patches)))
            patch_size = int(len(patches[0]) ** 0.5)  # Assumindo que o patch é quadrado
            
            # Criar uma figura com subplots para a grid
            fig, axes = plt.subplots(grid_size, grid_size, figsize=(10, 10))
            fig.suptitle(f'Patches da primeira imagem da categoria: {category}')
            
            # Preencher a grid com os patches
            for i in range(grid_size):
                for j in range(grid_size):
                    patch_idx = i * grid_size + j
                    if patch_idx < num_patches:
                        patch = patches[patch_idx]
                        axes[i, j].imshow(patch.reshape(patch_size, patch_size), cmap='gray')
                    axes[i, j].axis('off')
            
            plt.tight_layout()
            plt.show()
        else:
            print("Nenhum patch encontrado na primeira imagem.")
    else:
        print("Nenhuma imagem encontrada nessa categoria.")
else:
    print(f"Categoria '{category}' não encontrada.")


In [None]:
centered_test_patches_by_category = {}

for category, images in test_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    
    centered_test_patches_by_category[category] = centered_images

print(centered_test_patches_by_category['Office'][list(centered_test_patches_by_category['Office'].keys())[0]][0].shape)
print(centered_test_patches_by_category['Coast'][list(centered_test_patches_by_category['Coast'].keys())[0]][0].shape)


In [None]:
import matplotlib.pyplot as plt
import math

# Definir a categoria
category = 'Coast'

# Verificar se a categoria existe nos patches carregados
if category in centered_test_patches_by_category:
    # Obter os patches da categoria
    category_patches = centered_test_patches_by_category[category]
    
    # Obter o primeiro image_id e seus patches correspondentes
    if category_patches:
        first_image_id = next(iter(category_patches))
        patches, _ = category_patches[first_image_id]
        
        # Verificar se há patches
        if patches.any():
            # Definir o tamanho da grid
            num_patches = len(patches)
            grid_size = int(math.ceil(math.sqrt(num_patches)))
            patch_size = int(len(patches[0]) ** 0.5)  # Assumindo que o patch é quadrado
            
            # Criar uma figura com subplots para a grid
            fig, axes = plt.subplots(grid_size, grid_size, figsize=(10, 10))
            fig.suptitle(f'Patches da primeira imagem da categoria: {category}')
            
            # Preencher a grid com os patches
            for i in range(grid_size):
                for j in range(grid_size):
                    patch_idx = i * grid_size + j
                    if patch_idx < num_patches:
                        patch = patches[patch_idx]
                        axes[i, j].imshow(patch.reshape(patch_size, patch_size), cmap='gray')
                    axes[i, j].axis('off')
            
            plt.tight_layout()
            plt.show()
        else:
            print("Nenhum patch encontrado na primeira imagem.")
    else:
        print("Nenhuma imagem encontrada nessa categoria.")
else:
    print(f"Categoria '{category}' não encontrada.")


In [None]:
for category, images in test_patches_by_category.items():
    count = 0
    for image_id, (patches, positions) in images.items():
        if count >= 3:
            break
        plot_distribution(patches, f'Original data distribution - {category} (Image ID: {image_id})')
        centered_patches = centered_test_patches_by_category[category][image_id][0]
        plot_distribution(centered_patches, f'Centered data distribution - {category} (Image ID: {image_id})')
        count += 1

Now we should compute the projection into the residual space of a given data. The main logic here is to:

In [None]:
def calculate_ood(patches, residuals):
    residual_norms = np.linalg.norm(residuals, axis=1)
    original_norms = np.linalg.norm(patches, axis=1)
    ood_scores = residual_norms / original_norms
    return ood_scores

In [None]:
for category, pca in reduced_components_pca_by_category_90.items():
    print(f"Category: {category}, Number of components: {pca.n_components_}")
    print(f"Explained variance by components: {pca.explained_variance_ratio_}")


In [None]:
def save_residual_images_as_full_image(residuals_by_category, output_dir, patch_size=(32, 32), original_image_shape=(256, 256)):
    for category, images in residuals_by_category.items():
        category_dir = os.path.join(output_dir, category)
        os.makedirs(category_dir, exist_ok=True)
        
        for image_id, (residuals, positions) in images.items():
            residual_image = reassemble_image_from_patches(residuals, positions, original_image_shape, patch_size)
            
            residual_image = np.clip(residual_image, 0, 255)
            residual_image = residual_image.astype(np.uint8)
            
            residual_image_filename = f"residual_image_{image_id}.png"
            residual_image_path = os.path.join(category_dir, residual_image_filename)
            cv2.imwrite(residual_image_path, residual_image)

residuals_output_dir = "different_reduced_components_pca_images_residuals"

def calculate_residuals_with_pca(pca_by_category, patches_by_category):
    residuals_by_category = {}
    for category, images in patches_by_category.items():
        residuals_images = {}
        for image_id, (patches, positions) in images.items():
            pca = pca_by_category[category]
            projected_data = pca.transform(patches)
            reconstructed_data = pca.inverse_transform(projected_data)
            residuals = patches - reconstructed_data
            residuals_images[image_id] = (residuals, positions)
        residuals_by_category[category] = residuals_images
    return residuals_by_category

all_components_residuals_by_category = calculate_residuals_with_pca(all_components_pca_by_category, centered_test_patches_by_category)

reduced_components_residuals_by_category_95 = calculate_residuals_with_pca(reduced_components_pca_by_category_95, centered_test_patches_by_category)
reduced_components_residuals_by_category_90 = calculate_residuals_with_pca(reduced_components_pca_by_category_90, centered_test_patches_by_category)
reduced_components_residuals_by_category_85 = calculate_residuals_with_pca(reduced_components_pca_by_category_85, centered_test_patches_by_category)
reduced_components_residuals_by_category_80 = calculate_residuals_with_pca(reduced_components_pca_by_category_80, centered_test_patches_by_category)

save_residual_images_as_full_image(reduced_components_residuals_by_category_90, residuals_output_dir)


In [None]:
def calculate_ood_scores(residuals, original_patches):
    if residuals.shape != original_patches.shape:
        print(f"Shape mismatch in calculate_ood_scores: residuals={residuals.shape}, original_patches={original_patches.shape}")
        return float('nan')
    
    scores = calculate_ood(original_patches, residuals)
    ood_score = np.mean(scores)
    return ood_score

def process_and_calculate_ood(residuals_list, original_patches_list):
    total_ood_scores = []
    
    # Iterate through residuals and original patches without concatenation
    for (residuals, _), (original_patches, _) in zip(residuals_list, original_patches_list):
        if residuals.shape[0] != original_patches.shape[0]:
            print(f"Mismatch in number of patches: residuals={residuals.shape[0]}, original_patches={original_patches.shape[0]}")
            continue
        
        # Calculate OOD score for each pair of residuals and original patches
        ood_score = calculate_ood_scores(residuals, original_patches)
        
        if not np.isnan(ood_score):  # Ignore NaN scores
            total_ood_scores.append(ood_score)
    
    if not total_ood_scores:
        print("No data available for calculation.")
        return float('nan')
    
    # Return the mean OOD score across all images
    return np.mean(total_ood_scores)


coast_residuals = list(all_components_residuals_by_category['Coast'].values())
office_residuals = list(all_components_residuals_by_category['Office'].values())

coast_original_patches = list(centered_test_patches_by_category['Coast'].values())
office_original_patches = list(centered_test_patches_by_category['Office'].values())

ood_score_coast_train_coast_test = process_and_calculate_ood(coast_residuals, coast_original_patches)
ood_score_coast_train_office_test = process_and_calculate_ood(coast_residuals, office_original_patches)
ood_score_office_train_office_test = process_and_calculate_ood(office_residuals, office_original_patches)
ood_score_office_train_coast_test = process_and_calculate_ood(office_residuals, coast_original_patches)

print("OOD Scores using Coast test Data on Coast Testing Data:")
print(f"OOD Score: {ood_score_coast_train_coast_test}")

print("\nOOD Scores using Coast test Data on Office Testing Data:")
print(f"OOD Score: {ood_score_coast_train_office_test}")

print("\nOOD Scores using Office test Data on Office Testing Data:")
print(f"OOD Score: {ood_score_office_train_office_test}")

print("\nOOD Scores using Office test Data on Coast Testing Data:")
print(f"OOD Score: {ood_score_office_train_coast_test}")


In [None]:
ood_scores_all_levels = {
    0.95: {
        'coast_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Coast'].values()), list(centered_test_patches_by_category['Coast'].values())),
        'coast_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Coast'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Office'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Office'].values()), list(centered_test_patches_by_category['Coast'].values()))
    },
    0.9: {
        'coast_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Coast'].values()), list(centered_test_patches_by_category['Coast'].values())),
        'coast_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Coast'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Office'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Office'].values()), list(centered_test_patches_by_category['Coast'].values()))
    },
    0.85: {
        'coast_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Coast'].values()), list(centered_test_patches_by_category['Coast'].values())),
        'coast_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Coast'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Office'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Office'].values()), list(centered_test_patches_by_category['Coast'].values()))
    },
    0.8: {
        'coast_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Coast'].values()), list(centered_test_patches_by_category['Coast'].values())),
        'coast_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Coast'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_office_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Office'].values()), list(centered_test_patches_by_category['Office'].values())),
        'office_train_coast_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Office'].values()), list(centered_test_patches_by_category['Coast'].values()))
    }
}

for variance, scores in ood_scores_all_levels.items():
    print(f"\nVariance level: {variance}")
    for test_type, score in scores.items():
        print(f"{test_type}: OOD Score: {score}")

## Agnostic Spaces

In [None]:
pca_results = {}

# Lista de percentuais de variância explicada para os quais você quer calcular
percentages = [95, 90, 85, 80]

# Supondo que você tenha as variáveis 'reduced_components_pca_by_category_X' (onde X é o percentual)
# e que elas contêm os resultados do PCA
for perc in percentages:
    pca_data = globals().get(f'reduced_components_pca_by_category_{perc}', None)
    
    if pca_data is not None:
        # Inicializar o dicionário para o percentual específico
        pca_results[perc] = {}
        
        for category in categories:
            if category in pca_data:
                pca = pca_data[category]
                components = pca.components_
                explained_variance_ratio = pca.explained_variance_ratio_
                
                # Armazenar os componentes, a variância explicada e o próprio objeto PCA
                pca_results[perc][category] = {
                    'components': components,
                    'explained_variance_ratio': explained_variance_ratio,
                    'pca_object': pca  # Agora salvamos o PCA object também
                }
            else:
                print(f"Categoria '{category}' não está presente nos dados para {perc}%.")
    else:
        print(f"Dados de PCA não encontrados para {perc}%.")


In [None]:
def plot_cumulative_variance(explained_variance_ratio):
    """
    Plota a variância explicada acumulada com base na variância explicada de cada componente principal.
    
    Parameters:
    - explained_variance_ratio: Array ou lista contendo a variância explicada por componente.
    """
    cumulative_variance = np.cumsum(explained_variance_ratio)  # Calcula a variância acumulada

    plt.figure(figsize=(8, 5))
    plt.plot(np.arange(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
    plt.title('Cumulative Explained Variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.grid(True)
    plt.show()

In [None]:
# Verificar se a estrutura de pca_results contém a variância explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Coast'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
# Verificar se a estrutura de pca_results contém a variância explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Office'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


- Number of Components:

The "Coast" category retained 23 components, while the "Office" category retained 22. This difference suggests that the amount of variance in the "Coast" data is more spread out across components compared to "Office".

- Variance Explained:

The first principal component in both categories captures the majority of the variance, but significantly more in "Coast" (80.18%) than in "Office" (70.35%). This indicates that the dominant pattern in "Coast" images is stronger or more distinct compared to "Office".

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Função para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    return np.dot(patches, pca_components.T)

# Função para calcular normas, médias e médias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Função para plotar as normas médias para todas as imagens combinadas
def plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color='blue'):
    plt.figure(figsize=(10, 6))
    
    # Plotar os resultados para todos os componentes combinados em todas as imagens
    plt.bar(range(len(mean_of_means_norms)), mean_of_means_norms, color=color,
            label=f'{category} on {other_category} - All Images')
    
    plt.title(f'Mean of Norms for Components ({category} on {other_category}) - 95% Variance Explained')
    plt.xlabel('Component Index')
    plt.ylabel('Mean of Norms')
    plt.legend()
    plt.show()

# Lista de categorias para iterar
categories = ['Coast', 'Office']

# Trabalhando apenas com o PCA de 95% de variância explicada
perc = 95

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes da própria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
            color = 'blue'
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
            color = 'green'
        
        # Armazenar as normas médias para todas as imagens
        all_means_norms = []
        
        # Projeção dos patches (intra ou cross-categoria) para todas as imagens
        for image_id, (patches, positions) in centered_test_patches_by_category[category].items():
            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, médias e médias das normas para cada imagem
            _, _, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Armazenar as normas calculadas para a imagem
            all_means_norms.append(means_norms_category)
        
        # Calcular a média das normas para todas as imagens
        mean_of_means_norms = np.mean(all_means_norms, axis=0)  # Média das normas em todas as imagens
        
        # Plotar os valores médios das normas para todos os componentes combinados
        plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color=color)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Função para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    return np.dot(patches, pca_components.T)

# Função para calcular normas, médias e médias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Função para capturar componentes que explicam ~90% da variância com as menores normas, excluindo grandes normas
def capture_components_by_variance_and_norm(explained_variance_ratio, norms, variance_threshold=0.9, norm_threshold=1e7):
    # Calcular a variância explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os índices que explicam até ~90% da variância
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir os componentes com normas muito grandes
    selected_indices = [i for i in selected_indices if norms[i] < norm_threshold]
    
    # Ordenar os índices por norma
    selected_indices = sorted(selected_indices, key=lambda idx: norms[idx])
    
    return selected_indices

# Lista de categorias para iterar
categories = ['Coast', 'Office']

# Trabalhando apenas com o PCA de 95% de variância explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e variância explicada da própria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Projeção dos patches (intra ou cross-categoria)
        all_norms = []
        all_selected_indices = []
        
        for image_id, (patches, positions) in centered_test_patches_by_category[category].items():
            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, médias e médias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes que explicam até ~90% da variância com as menores normas, excluindo normas muito grandes
            selected_indices = capture_components_by_variance_and_norm(explained_variance_ratio, norms_category)
            
            # Armazenar os resultados de normas e componentes selecionados
            all_norms.append(norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices))
        
        # Inicializar os dicionários se as chaves não existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se não houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue

        # Plotar os resultados para os componentes selecionados
        plt.figure(figsize=(10, 6))
        plt.bar(aggregated_selected_indices, [np.mean([norms[i] for norms in all_norms if i < len(norms)]) for i in aggregated_selected_indices], 
                color='green' if category != other_category else 'blue',
                label=f'{category} on {other_category} - Selected Components')
        plt.title(f'Selected Components Based on ~90% Variance and Smallest Norms ({category} on {other_category}) - 95% Variance Explained')
        plt.xlabel('Component Index')
        plt.ylabel('Mean of Norms')
        plt.legend()
        plt.show()


In [None]:
centered_test_office_patches = centered_test_patches_by_category['Office']
centered_test_coast_patches = centered_test_patches_by_category['Coast']

In [None]:
import os
import numpy as np

# Parâmetros
patch_size = (32, 32)

def project_and_transform_back(data, pca, specific_indices):
    """
    Projeta os dados nos componentes principais específicos e reconstrói a partir desses componentes.
    """
    # Projeção dos patches nos componentes principais
    projected = pca.transform(data)
    
    # Usar apenas os componentes específicos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir os patches apenas com os componentes específicos
    specific_components = pca.components_[specific_indices]
    reconstructed_patches = np.dot(projected_specific, specific_components)
    
    return reconstructed_patches

def calculate_mean_ood_for_specific_components(original_patches_by_image, pca, specific_indices):
    """
    Projeta os patches originais em componentes PCA específicos, reconstrói e calcula a média dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as imagens
    for image_id, (patches, _) in original_patches_by_image.items():
        # Projeção e reconstrução dos patches nos componentes específicos
        reconstructed_patches = project_and_transform_back(patches, pca, specific_indices)
        
        # Calcula os resíduos (erro de reconstrução)
        residuals = patches - reconstructed_patches
        
        # Calcular a pontuação OOD (norma dos resíduos sobre a norma dos patches originais)
        original_norms = np.linalg.norm(patches, axis=1)
        residual_norms = np.linalg.norm(residuals, axis=1)
        
        # Calcular a pontuação OOD para todos os patches
        ood_scores = residual_norms / original_norms
        
        # Adiciona as pontuações OOD desta imagem à lista total
        total_ood_scores.extend(ood_scores)
    
    # Retorna a média das pontuações OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as médias das pontuações OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se os patches de teste existem para a categoria
        if category not in centered_test_patches_by_category:
            print(f"Warning: No test patches found for {category}. Skipping.")
            continue
        
        
        # Calcular a média das pontuações OOD com base na projeção nos componentes específicos
        mean_ood = calculate_mean_ood_for_specific_components(centered_test_patches_by_category[category], pca_object, specific_indices)
        
        # Armazenar a média no dicionário
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as médias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


## ✌️ Part II: Comparing two similar environments

In [None]:
train_categories = ['Bedroom', 'LivingRoom']

df_similar = df[df['category'].isin(train_categories)]
df_similar

In [None]:
X = df_similar['image_path']
y = df_similar['category']
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=10)


In [None]:
create_images_set(X_train, X_test, y_train, y_test, patch_size, output_dir_train='patches_train', output_dir_test='patches_test')

In [None]:
training_patches_by_category = load_patches_by_category('patches_train', y)

In [None]:
centered_training_patches_by_category = {}
for category, images in training_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    centered_training_patches_by_category[category] = centered_images

print(centered_training_patches_by_category['Bedroom'][list(centered_training_patches_by_category['Bedroom'].keys())[0]][0].shape)
print(centered_training_patches_by_category['LivingRoom'][list(centered_training_patches_by_category['LivingRoom'].keys())[0]][0].shape)

In [None]:
categories = train_categories
base_input_dir = "patches_train"
base_output_dir = "similar_original_images_raw"

patch_size = (32, 32)
original_image_shape = (256, 256)

training_patches_by_category = load_patches_by_category(base_input_dir, categories)

for category, image_patches in training_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        centered_patches = center_patches(np.array(patches))
        reconstructed_image = reassemble_image_from_patches(centered_patches, positions, original_image_shape, patch_size)
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)

In [None]:
for category, images in training_patches_by_category.items():
    count = 0
    for image_id, (patches, positions) in images.items():
        if count >= 3:
            break
        plot_distribution(patches, f'Original data distribution - {category} (Image ID: {image_id})')
        centered_patches = centered_training_patches_by_category[category][image_id][0]
        plot_distribution(centered_patches, f'Centered data distribution - {category} (Image ID: {image_id})')
        count += 1


In [None]:
all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)
num_components_all_dict = {category: 1024 for category in centered_training_patches_by_category}
visualize_pca_components(all_components_pca_by_category, num_components_all_dict)

In [None]:
reduced_components_pca_by_category_95, num_components_reduced_dict_95, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.95)
reduced_components_pca_by_category_90, num_components_reduced_dict_90, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.90)
reduced_components_pca_by_category_85, num_components_reduced_dict_85, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.85)
reduced_components_pca_by_category_80, num_components_reduced_dict_80, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.80)

In [None]:
def print_num_components(num_components_reduced_dict, variance_level):
    print(f"\nNumber of Components for {variance_level*100}% Variance Explained:")
    for category, num_components in num_components_reduced_dict.items():
        print(f"Category: {category}, Number of Components: {num_components}")

# Imprimir o número de componentes para cada nível de variância explicada
print_num_components(num_components_reduced_dict_95, 0.95)
print_num_components(num_components_reduced_dict_90, 0.90)
print_num_components(num_components_reduced_dict_85, 0.85)
print_num_components(num_components_reduced_dict_80, 0.80)

In [None]:
all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)

reconstructed_patches_by_category = project_and_reconstruct_patches(all_components_pca_by_category, centered_training_patches_by_category)

In [None]:
base_output_dir = "similar_all_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


In [None]:
reduced_reconstructed_patches_by_category = project_and_reconstruct_patches(reduced_components_pca_by_category_90, centered_training_patches_by_category)

In [None]:
base_output_dir = "similar_reduced_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reduced_reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


Test

In [None]:
test_patches_by_category = load_patches_by_category('patches_test', y)

In [None]:
centered_test_patches_by_category = {}

for category, images in test_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    
    centered_test_patches_by_category[category] = centered_images

print(centered_test_patches_by_category['Bedroom'][list(centered_test_patches_by_category['Bedroom'].keys())[0]][0].shape)
print(centered_test_patches_by_category['LivingRoom'][list(centered_test_patches_by_category['LivingRoom'].keys())[0]][0].shape)


In [None]:
for category, images in test_patches_by_category.items():
    count = 0
    for image_id, (patches, positions) in images.items():
        if count >= 3:
            break
        plot_distribution(patches, f'Original data distribution - {category} (Image ID: {image_id})')
        centered_patches = centered_test_patches_by_category[category][image_id][0]
        plot_distribution(centered_patches, f'Centered data distribution - {category} (Image ID: {image_id})')
        count += 1

In [None]:
for category, pca in reduced_components_pca_by_category_90.items():
    print(f"Category: {category}, Number of components: {pca.n_components_}")
    print(f"Explained variance by components: {pca.explained_variance_ratio_}")

In [None]:
residuals_output_dir = "similar_reduced_components_pca_images_residuals"

In [None]:
all_components_residuals_by_category = calculate_residuals_with_pca(all_components_pca_by_category, centered_test_patches_by_category)

reduced_components_residuals_by_category_95 = calculate_residuals_with_pca(reduced_components_pca_by_category_95, centered_test_patches_by_category)
reduced_components_residuals_by_category_90 = calculate_residuals_with_pca(reduced_components_pca_by_category_90, centered_test_patches_by_category)
reduced_components_residuals_by_category_85 = calculate_residuals_with_pca(reduced_components_pca_by_category_85, centered_test_patches_by_category)
reduced_components_residuals_by_category_80 = calculate_residuals_with_pca(reduced_components_pca_by_category_80, centered_test_patches_by_category)

save_residual_images_as_full_image(reduced_components_residuals_by_category_90, residuals_output_dir)

In [None]:
Bedroom_residuals = list(all_components_residuals_by_category['Bedroom'].values())
LivingRoom_residuals = list(all_components_residuals_by_category['LivingRoom'].values())

Bedroom_original_patches = list(centered_test_patches_by_category['Bedroom'].values())
LivingRoom_original_patches = list(centered_test_patches_by_category['LivingRoom'].values())

ood_score_Bedroom_train_Bedroom_test = process_and_calculate_ood(Bedroom_residuals, Bedroom_original_patches)
ood_score_Bedroom_train_LivingRoom_test = process_and_calculate_ood(Bedroom_residuals, LivingRoom_original_patches)
ood_score_LivingRoom_train_LivingRoom_test = process_and_calculate_ood(LivingRoom_residuals, LivingRoom_original_patches)
ood_score_LivingRoom_train_Bedroom_test = process_and_calculate_ood(LivingRoom_residuals, Bedroom_original_patches)

print("OOD Scores using Bedroom test Data on Bedroom Testing Data:")
print(f"OOD Score: {ood_score_Bedroom_train_Bedroom_test}")

print("\nOOD Scores using Bedroom test Data on LivingRoom Testing Data:")
print(f"OOD Score: {ood_score_Bedroom_train_LivingRoom_test}")

print("\nOOD Scores using LivingRoom test Data on LivingRoom Testing Data:")
print(f"OOD Score: {ood_score_LivingRoom_train_LivingRoom_test}")

print("\nOOD Scores using LivingRoom test Data on Bedroom Testing Data:")
print(f"OOD Score: {ood_score_LivingRoom_train_Bedroom_test}")

In [None]:
ood_scores_all_levels = {
    0.95: {
        'Bedroom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Bedroom'].values()), list(centered_test_patches_by_category['Bedroom'].values())),
        'Bedroom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['Bedroom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['LivingRoom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_95['LivingRoom'].values()), list(centered_test_patches_by_category['Bedroom'].values()))
    },
    0.9: {
        'Bedroom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Bedroom'].values()), list(centered_test_patches_by_category['Bedroom'].values())),
        'Bedroom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['Bedroom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['LivingRoom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_90['LivingRoom'].values()), list(centered_test_patches_by_category['Bedroom'].values()))
    },
    0.85: {
        'Bedroom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Bedroom'].values()), list(centered_test_patches_by_category['Bedroom'].values())),
        'Bedroom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['Bedroom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['LivingRoom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_85['LivingRoom'].values()), list(centered_test_patches_by_category['Bedroom'].values()))
    },
    0.8: {
        'Bedroom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Bedroom'].values()), list(centered_test_patches_by_category['Bedroom'].values())),
        'Bedroom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['Bedroom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_LivingRoom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['LivingRoom'].values()), list(centered_test_patches_by_category['LivingRoom'].values())),
        'LivingRoom_train_Bedroom_test': process_and_calculate_ood(list(reduced_components_residuals_by_category_80['LivingRoom'].values()), list(centered_test_patches_by_category['Bedroom'].values()))
    }
}

for variance, scores in ood_scores_all_levels.items():
    print(f"\nVariance level: {variance}")
    for test_type, score in scores.items():
        print(f"{test_type}: OOD Score: {score}")

# Agnostic Spaces

In [None]:
# Inicializar o dicionário para armazenar os resultados de PCA
pca_results = {}

# Lista de percentuais de variância explicada para os quais você quer calcular
percentages = [95, 90, 85, 80]

# Supondo que você tenha as variáveis 'reduced_components_pca_by_category_X' (onde X é o percentual)
# e que elas contêm os resultados do PCA
for perc in percentages:
    pca_data = globals().get(f'reduced_components_pca_by_category_{perc}', None)
    
    if pca_data is not None:
        # Inicializar o dicionário para o percentual específico
        pca_results[perc] = {}
        
        for category in categories:
            if category in pca_data:
                pca = pca_data[category]
                components = pca.components_
                explained_variance_ratio = pca.explained_variance_ratio_
                
                # Armazenar os componentes, a variância explicada e o próprio objeto PCA
                pca_results[perc][category] = {
                    'components': components,
                    'explained_variance_ratio': explained_variance_ratio,
                    'pca_object': pca  # Agora salvamos o PCA object também
                }
            else:
                print(f"Categoria '{category}' não está presente nos dados para {perc}%.")
    else:
        print(f"Dados de PCA não encontrados para {perc}%.")


In [None]:
def plot_cumulative_variance(explained_variance_ratio):
    """
    Plota a variância explicada acumulada com base na variância explicada de cada componente principal.
    
    Parameters:
    - explained_variance_ratio: Array ou lista contendo a variância explicada por componente.
    """
    cumulative_variance = np.cumsum(explained_variance_ratio)  # Calcula a variância acumulada

    plt.figure(figsize=(8, 5))
    plt.plot(np.arange(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
    plt.title('Cumulative Explained Variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.grid(True)
    plt.show()

In [None]:
# Verificar se a estrutura de pca_results contém a variância explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'LivingRoom'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
# Verificar se a estrutura de pca_results contém a variância explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Bedroom'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Função para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    return np.dot(patches, pca_components.T)

# Função para calcular normas, médias e médias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Função para plotar as normas médias para todas as imagens combinadas
def plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color='blue'):
    plt.figure(figsize=(10, 6))
    
    # Plotar os resultados para todos os componentes combinados em todas as imagens
    plt.bar(range(len(mean_of_means_norms)), mean_of_means_norms, color=color,
            label=f'{category} on {other_category} - All Images')
    
    plt.title(f'Mean of Norms for Components ({category} on {other_category}) - 95% Variance Explained')
    plt.xlabel('Component Index')
    plt.ylabel('Mean of Norms')
    plt.legend()
    plt.show()

# Lista de categorias para iterar
categories = ['LivingRoom', 'Bedroom']

# Trabalhando apenas com o PCA de 95% de variância explicada
perc = 95

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes da própria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
            color = 'blue'
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
            color = 'green'
        
        # Armazenar as normas médias para todas as imagens
        all_means_norms = []
        
        # Projeção dos patches (intra ou cross-categoria) para todas as imagens
        for image_id, (patches, positions) in centered_test_patches_by_category[category].items():
            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, médias e médias das normas para cada imagem
            _, _, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Armazenar as normas calculadas para a imagem
            all_means_norms.append(means_norms_category)
        
        # Calcular a média das normas para todas as imagens
        mean_of_means_norms = np.mean(all_means_norms, axis=0)  # Média das normas em todas as imagens
        
        # Plotar os valores médios das normas para todos os componentes combinados
        plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color=color)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Função para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    return np.dot(patches, pca_components.T)

# Função para calcular normas, médias e médias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Função para capturar componentes que explicam ~90% da variância com as menores normas, excluindo grandes normas
def capture_components_by_variance_and_norm(explained_variance_ratio, norms, variance_threshold=0.9, norm_threshold=1e7):
    # Calcular a variância explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os índices que explicam até ~90% da variância
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir os componentes com normas muito grandes
    selected_indices = [i for i in selected_indices if norms[i] < norm_threshold]
    
    # Ordenar os índices por norma
    selected_indices = sorted(selected_indices, key=lambda idx: norms[idx])
    
    return selected_indices

# Lista de categorias para iterar
categories = ['Bedroom', 'LivingRoom']

# Trabalhando apenas com o PCA de 95% de variância explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e variância explicada da própria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Projeção dos patches (intra ou cross-categoria)
        all_norms = []
        all_selected_indices = []
        
        for image_id, (patches, positions) in test_patches_by_category[category].items():
            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, médias e médias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes que explicam até ~90% da variância com as menores normas, excluindo normas muito grandes
            selected_indices = capture_components_by_variance_and_norm(explained_variance_ratio, norms_category)
            
            # Armazenar os resultados de normas e componentes selecionados
            all_norms.append(norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices))
        
        # Inicializar os dicionários se as chaves não existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se não houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue

        # Plotar os resultados para os componentes selecionados
        plt.figure(figsize=(10, 6))
        plt.bar(aggregated_selected_indices, [np.mean([norms[i] for norms in all_norms if i < len(norms)]) for i in aggregated_selected_indices], 
                color='green' if category != other_category else 'blue',
                label=f'{category} on {other_category} - Selected Components')
        plt.title(f'Selected Components Based on ~90% Variance and Smallest Norms ({category} on {other_category}) - 95% Variance Explained')
        plt.xlabel('Component Index')
        plt.ylabel('Mean of Norms')
        plt.legend()
        plt.show()


In [None]:
centered_test_Bedroom_patches = centered_test_patches_by_category['Bedroom']
centered_test_LivingRoom_patches = centered_test_patches_by_category['LivingRoom']

In [None]:
import os
import numpy as np

# Parâmetros
patch_size = (32, 32)

def project_and_transform_back(data, pca, specific_indices):
    """
    Projeta os dados nos componentes principais específicos e reconstrói a partir desses componentes.
    """
    # Projeção dos patches nos componentes principais
    projected = pca.transform(data)
    
    # Usar apenas os componentes específicos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir os patches apenas com os componentes específicos
    specific_components = pca.components_[specific_indices]
    reconstructed_patches = np.dot(projected_specific, specific_components)
    
    return reconstructed_patches

def calculate_mean_ood_for_specific_components(original_patches_by_image, pca, specific_indices):
    """
    Projeta os patches originais em componentes PCA específicos, reconstrói e calcula a média dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as imagens
    for image_id, (patches, _) in original_patches_by_image.items():
        # Projeção e reconstrução dos patches nos componentes específicos
        reconstructed_patches = project_and_transform_back(patches, pca, specific_indices)
        
        # Calcula os resíduos (erro de reconstrução)
        residuals = patches - reconstructed_patches
        
        # Calcular a pontuação OOD (norma dos resíduos sobre a norma dos patches originais)
        original_norms = np.linalg.norm(patches, axis=1)
        residual_norms = np.linalg.norm(residuals, axis=1)
        
        # Calcular a pontuação OOD para todos os patches
        ood_scores = residual_norms / original_norms
        
        # Adiciona as pontuações OOD desta imagem à lista total
        total_ood_scores.extend(ood_scores)
    
    # Retorna a média das pontuações OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as médias das pontuações OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se os patches de teste existem para a categoria
        if category not in centered_test_patches_by_category:
            print(f"Warning: No test patches found for {category}. Skipping.")
            continue
        
        # Calcular a média das pontuações OOD com base na projeção nos componentes específicos
        mean_ood = calculate_mean_ood_for_specific_components(centered_test_patches_by_category[category], pca_object, specific_indices)
        
        # Armazenar a média no dicionário
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as médias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


# All environments

In [None]:
X = df['image_path']
y = df['category']
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=10)

In [None]:
create_images_set(X_train, X_test, y_train, y_test, patch_size, output_dir_train='patches_train', output_dir_test='patches_test')

In [None]:
def load_patches_by_category(base_dir, categories):
    patches_by_category = {}
    
    for category in categories:
        category_patches = {}
        category_dir = os.path.join(base_dir, str(category))
        
        for root, _, files in os.walk(category_dir):
            files = [f for f in files if f.endswith('.png') and '_patch_' in f]
            files = sorted(files, key=lambda x: (int(x.split('_')[1]), int(x.split('_')[3]), int(x.split('_')[4].split('.')[0])))

            for filename in files:
                try:
                    parts = filename.split('_')
                    image_id = int(parts[1])
                    y = int(parts[3])
                    x = int(parts[4].split('.')[0])
                    patch = cv2.imread(os.path.join(root, filename), cv2.IMREAD_GRAYSCALE)
                    if patch is not None:
                        if image_id not in category_patches:
                            category_patches[image_id] = ([], [])
                        category_patches[image_id][0].append(patch.flatten())
                        category_patches[image_id][1].append((y, x))
                except (IndexError, ValueError) as e:
                    print(f"Error processing file {filename}: {e}")
                    continue

        patches_by_category[category] = category_patches
    
    return patches_by_category

In [None]:
training_patches_by_category = load_patches_by_category('patches_train', y)

In [None]:
def center_patches(patches):
    return patches - patches.mean(axis=0)

centered_training_patches_by_category = {}
for category, images in training_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    centered_training_patches_by_category[category] = centered_images

In [None]:
for category, images in centered_training_patches_by_category.items():
    sorted_ids = sorted(images.keys())
    print(f"Sorted Image IDs for category {category}: {sorted_ids}")

In [None]:
def reassemble_image_from_patches(patches, positions, original_image_shape, patch_size):
    reconstructed_image = np.zeros(original_image_shape, dtype=np.float32)
    patch_height, patch_width = patch_size

    for patch, (i, j) in zip(patches, positions):
        if i + patch_height <= original_image_shape[0] and j + patch_width <= original_image_shape[1]:
            patch = patch.reshape((patch_height, patch_width))
            reconstructed_image[i:i + patch_height, j:j + patch_width] = patch

    return reconstructed_image

def save_reconstructed_image(reconstructed_image, save_path):
    reconstructed_image_uint8 = np.clip(reconstructed_image, 0, 255).astype(np.uint8) #perdre d'info avec ça
    cv2.imwrite(save_path, reconstructed_image_uint8)

categories = df['category'].unique()
base_input_dir = "patches_train"
base_output_dir = "all-env_original_images_raw"

patch_size = (32, 32)
original_image_shape = (256, 256)


for category, image_patches in training_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        centered_patches = center_patches(np.array(patches))
        reconstructed_image = reassemble_image_from_patches(centered_patches, positions, original_image_shape, patch_size)
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


In [None]:
all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)
num_components_all_dict = {category: 1024 for category in centered_training_patches_by_category}

In [None]:
reduced_components_pca_by_category_95, num_components_reduced_dict_95, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.95)
reduced_components_pca_by_category_90, num_components_reduced_dict_90, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.90)
reduced_components_pca_by_category_85, num_components_reduced_dict_85, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.85)
reduced_components_pca_by_category_80, num_components_reduced_dict_80, _ = apply_reduced_pca(centered_training_patches_by_category, number_variance=0.80)

In [None]:
def print_num_components(num_components_reduced_dict, variance_level):
    print(f"\nNumber of Components for {variance_level*100}% Variance Explained:")
    for category, num_components in num_components_reduced_dict.items():
        print(f"Category: {category}, Number of Components: {num_components}")

# Imprimir o número de componentes para cada nível de variância explicada
print_num_components(num_components_reduced_dict_95, 0.95)
print_num_components(num_components_reduced_dict_90, 0.90)
print_num_components(num_components_reduced_dict_85, 0.85)
print_num_components(num_components_reduced_dict_80, 0.80)

In [None]:
def project_and_reconstruct_patches(pca_by_category, centered_patches_by_category):
    reconstructed_patches_by_category = {}

    for category, pca in pca_by_category.items():
        images = centered_patches_by_category[category]
        reconstructed_images = {}
        
        for image_id, (centered_patches, positions) in images.items():
            projected = pca.transform(centered_patches)
            reconstructed_patches = pca.inverse_transform(projected)
            reconstructed_images[image_id] = (reconstructed_patches, positions)
        
        reconstructed_patches_by_category[category] = reconstructed_images
    
    return reconstructed_patches_by_category

In [None]:
all_components_pca_by_category = apply_all_components_pca(centered_training_patches_by_category)

reconstructed_patches_by_category = project_and_reconstruct_patches(all_components_pca_by_category, centered_training_patches_by_category)


In [None]:
base_output_dir = "all-env_all_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)

        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


In [None]:
reduced_reconstructed_patches_by_category = project_and_reconstruct_patches(reduced_components_pca_by_category_90, centered_training_patches_by_category)

In [None]:
base_output_dir = "all-env_reduced_components_pca_images_reconstructed"
patch_size = (32, 32)
original_image_shape = (256, 256)

for category, image_patches in reduced_reconstructed_patches_by_category.items():
    for image_id, (patches, positions) in image_patches.items():
        reconstructed_image = reassemble_image_from_patches(patches, positions, original_image_shape, patch_size)
        
        output_dir = os.path.join(base_output_dir, category)
        os.makedirs(output_dir, exist_ok=True)
        output_path = os.path.join(output_dir, f"reconstructed_image_{image_id}.png")
        save_reconstructed_image(reconstructed_image, output_path)


# Test

In [None]:
test_patches_by_category = load_patches_by_category('patches_test', y)

In [None]:
centered_test_patches_by_category = {}

for category, images in test_patches_by_category.items():
    centered_images = {}
    for image_id, (patches, positions) in images.items():
        centered_patches = center_patches(np.array(patches))
        centered_images[image_id] = (centered_patches, positions)
    
    centered_test_patches_by_category[category] = centered_images


In [None]:
def save_residual_images_as_full_image(residuals_by_category, output_dir, patch_size=(32, 32), original_image_shape=(256, 256)):
    for category, images in residuals_by_category.items():
        category_dir = os.path.join(output_dir, category)
        os.makedirs(category_dir, exist_ok=True)
        
        for image_id, (residuals, positions) in images.items():
            residual_image = reassemble_image_from_patches(residuals, positions, original_image_shape, patch_size)
            
            residual_image = np.clip(residual_image, 0, 255)
            residual_image = residual_image.astype(np.uint8)
            
            residual_image_filename = f"residual_image_{image_id}.png"
            residual_image_path = os.path.join(category_dir, residual_image_filename)
            cv2.imwrite(residual_image_path, residual_image)

residuals_output_dir = "all-env_reduced_components_pca_images_residuals"

def calculate_residuals_with_pca(pca_by_category, patches_by_category):
    residuals_by_category = {}
    for category, images in patches_by_category.items():
        residuals_images = {}
        for image_id, (patches, positions) in images.items():
            pca = pca_by_category[category]
            projected_data = pca.transform(patches)
            reconstructed_data = pca.inverse_transform(projected_data)
            residuals = patches - reconstructed_data
            residuals_images[image_id] = (residuals, positions)
        residuals_by_category[category] = residuals_images
    return residuals_by_category

all_components_residuals_by_category = calculate_residuals_with_pca(all_components_pca_by_category, centered_test_patches_by_category)

reduced_components_residuals_by_category_95 = calculate_residuals_with_pca(reduced_components_pca_by_category_95, centered_test_patches_by_category)
reduced_components_residuals_by_category_90 = calculate_residuals_with_pca(reduced_components_pca_by_category_90, centered_test_patches_by_category)
reduced_components_residuals_by_category_85 = calculate_residuals_with_pca(reduced_components_pca_by_category_85, centered_test_patches_by_category)
reduced_components_residuals_by_category_80 = calculate_residuals_with_pca(reduced_components_pca_by_category_80, centered_test_patches_by_category)

save_residual_images_as_full_image(reduced_components_residuals_by_category_90, residuals_output_dir)


In [None]:
import itertools
import numpy as np

# Lista de categorias (ambientes)
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Função para calcular os OOD scores por imagem
def calculate_ood_scores(residuals, original_patches):
    if residuals.shape != original_patches.shape:
        print(f"Shape mismatch in calculate_ood_scores: residuals={residuals.shape}, original_patches={original_patches.shape}")
        return float('nan')
    
    # Calcular o OOD score por imagem como a média da razão entre norma do resíduo e norma do patch original
    residual_norms = np.linalg.norm(residuals, axis=1)
    original_norms = np.linalg.norm(original_patches, axis=1)
    ood_scores = residual_norms / original_norms
    return np.mean(ood_scores)

# Função para processar os dados e calcular OOD scores sem concatenar todos os patches
def process_and_calculate_ood(residuals_by_image, original_patches_by_image):
    total_ood_scores = []
    
    # Iterar sobre todas as imagens e calcular os OOD scores por imagem
    for (residuals, _), (original_patches, _) in zip(residuals_by_image, original_patches_by_image):
        if residuals.shape[0] != original_patches.shape[0]:
            print(f"Mismatch in number of patches: residuals={residuals.shape[0]}, original_patches={original_patches.shape[0]}")
            continue
        
        # Calcular o OOD score por imagem
        ood_score = calculate_ood_scores(residuals, original_patches)
        if not np.isnan(ood_score):
            total_ood_scores.append(ood_score)
    
    if not total_ood_scores:
        print("No valid data for calculation.")
        return float('nan')
    
    # Retorna a média dos OOD scores das imagens
    return np.mean(total_ood_scores)

# Iterar sobre todas as combinações de pares de categorias (ambientes)
ood_scores_dict = {}

for category_train, category_test in itertools.product(categories, repeat=2):
    try:
        # Recuperar os resíduos dos testes e patches originais para as categorias de treino e teste
        train_residuals = list(all_components_residuals_by_category[category_train].values())
        test_original_patches = list(centered_test_patches_by_category[category_test].values())
        
        # Calcular o OOD score usando os resíduos e patches originais dos testes
        ood_score = process_and_calculate_ood(train_residuals, test_original_patches)
        
        # Armazenar o resultado
        ood_scores_dict[f"{category_train}_train_on_{category_test}_test"] = ood_score
        print(f"Mean OOD Score using {category_train} Test Data on {category_test} Test Data: {ood_score}")
    except KeyError as e:
        print(f"Data not available for {category_train} testing on {category_test} testing. Skipping.")
        continue

# Exibir os resultados das pontuações OOD para todas as combinações
for key, ood_score in ood_scores_dict.items():
    print(f"Mean OOD Score for {key}: {ood_score}")


In [None]:
import itertools

# Variance levels to evaluate
variance_levels = [0.95, 0.9, 0.85, 0.8]

# Lista de categorias (ambientes)
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Dicionário para armazenar os scores OOD para cada nível de variância
ood_scores_all_levels = {}

# Iterar sobre cada nível de variância e calcular os OOD scores para todas as categorias
for variance in variance_levels:
    ood_scores_all_levels[variance] = {}
    
    # Iterar sobre todas as combinações possíveis de categorias de treino e teste
    for category_train, category_test in itertools.product(categories, repeat=2):
        try:
            # Recuperar os resíduos e patches originais para as categorias de treino e teste
            train_residuals = list(globals()[f"reduced_components_residuals_by_category_{int(variance*100)}"][category_train].values())
            test_original_patches = list(centered_test_patches_by_category[category_test].values())
            
            # Calcular o OOD score
            ood_score = process_and_calculate_ood(train_residuals, test_original_patches)
            
            # Armazenar o resultado
            ood_scores_all_levels[variance][f"{category_train}_train_{category_test}_test"] = ood_score
            print(f"Variance {variance}: OOD Score using {category_train} Train Data on {category_test} Test Data: {ood_score}")
        
        except KeyError as e:
            print(f"Data not available for {category_train} training on {category_test} testing at variance level {variance}. Skipping.")
            continue

# Exibir os resultados das pontuações OOD para todas as combinações e níveis de variância
for variance, scores in ood_scores_all_levels.items():
    print(f"\n--- Variance level: {variance} ---")
    for test_type, score in scores.items():
        print(f"{test_type}: OOD Score: {score}")


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Exemplo do formato de `ood_scores_all_levels` preenchido com pontuações OOD
ood_scores_all_levels = {
    0.95: {},
    0.9: {},
    0.85: {},
    0.8: {}
}

# Inicializando categorias
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Preenchendo `ood_scores_all_levels` dinamicamente para todos os níveis de variância
for variance in ood_scores_all_levels.keys():
    ood_scores_all_levels[variance] = {
        f"{category_train}_train_{category_test}_test": process_and_calculate_ood(
            list(globals()[f"reduced_components_residuals_by_category_{int(variance*100)}"][category_train].values()),
            list(centered_test_patches_by_category[category_test].values())
        )
        for category_train in categories
        for category_test in categories
    }

# Função para gerar heatmap usando apenas matplotlib
def plot_ood_heatmap(variance, ood_scores_all_levels):
    # Obter todas as pontuações OOD para o nível de variância específico
    ood_scores = ood_scores_all_levels[variance]
    
    # Construir a matriz de pontuações OOD
    score_matrix = np.zeros((len(categories), len(categories)))
    
    for i, category_train in enumerate(categories):
        for j, category_test in enumerate(categories):
            key = f"{category_train}_train_{category_test}_test"
            score_matrix[i, j] = ood_scores[key]
    
    # Plotar o heatmap usando matplotlib
    plt.figure(figsize=(12, 8))
    plt.imshow(score_matrix, cmap="coolwarm", interpolation='nearest')
    
    # Adicionar rótulos aos eixos
    plt.xticks(np.arange(len(categories)), categories, rotation=90)
    plt.yticks(np.arange(len(categories)), categories)
    
    # Adicionar os valores no heatmap
    for i in range(len(categories)):
        for j in range(len(categories)):
            plt.text(j, i, f"{score_matrix[i, j]:.2f}", ha="center", va="center", color="black")
    
    # Adicionar títulos e rótulos
    plt.colorbar(label='OOD Score')
    plt.title(f"OOD Scores Heatmap for Variance Level {variance}")
    plt.xlabel("Test Category")
    plt.ylabel("Train Category")
    plt.tight_layout()
    plt.show()

# Gerar heatmaps para cada nível de variância
for variance in ood_scores_all_levels.keys():
    plot_ood_heatmap(variance, ood_scores_all_levels)


# Agnostic

In [None]:
 # Inicializar o dicionário para armazenar os resultados de PCA
pca_results = {}

# Lista de percentuais de variância explicada para os quais você quer calcular
percentages = [95, 90, 85, 80]

# Supondo que você tenha as variáveis 'reduced_components_pca_by_category_X' (onde X é o percentual)
# e que elas contêm os resultados do PCA
for perc in percentages:
    pca_data = globals().get(f'reduced_components_pca_by_category_{perc}', None)
    
    if pca_data is not None:
        # Inicializar o dicionário para o percentual específico
        pca_results[perc] = {}
        
        for category in categories:
            if category in pca_data:
                pca = pca_data[category]
                components = pca.components_
                explained_variance_ratio = pca.explained_variance_ratio_
                
                # Armazenar os componentes, a variância explicada e o próprio objeto PCA
                pca_results[perc][category] = {
                    'components': components,
                    'explained_variance_ratio': explained_variance_ratio,
                    'pca_object': pca  # Agora salvamos o PCA object também
                }
            else:
                print(f"Categoria '{category}' não está presente nos dados para {perc}%.")
    else:
        print(f"Dados de PCA não encontrados para {perc}%.")


In [None]:
def plot_cumulative_variance(explained_variance_ratio):
    """
    Plota a variância explicada acumulada com base na variância explicada de cada componente principal.
    
    Parameters:
    - explained_variance_ratio: Array ou lista contendo a variância explicada por componente.
    """
    cumulative_variance = np.cumsum(explained_variance_ratio)  # Calcula a variância acumulada

    plt.figure(figsize=(8, 5))
    plt.plot(np.arange(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
    plt.title('Cumulative Explained Variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.grid(True)
    plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Função para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    return np.dot(patches, pca_components.T)

# Função para calcular normas, médias e médias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Função para capturar componentes que explicam ~90% da variância com as menores normas, excluindo grandes normas
def capture_components_by_variance_and_norm(explained_variance_ratio, norms, variance_threshold=0.9, norm_threshold=1e7):
    # Calcular a variância explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os índices que explicam até ~90% da variância
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir os componentes com normas muito grandes
    selected_indices = [i for i in selected_indices if norms[i] < norm_threshold]
    
    # Ordenar os índices por norma
    selected_indices = sorted(selected_indices, key=lambda idx: norms[idx])
    
    return selected_indices

# Lista de categorias para iterar
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Trabalhando apenas com o PCA de 95% de variância explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e variância explicada da própria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Projeção dos patches (intra ou cross-categoria)
        all_norms = []
        all_selected_indices = []
        
        for image_id, (patches, positions) in centered_test_patches_by_category[category].items():
            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, médias e médias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes que explicam até ~90% da variância com as menores normas, excluindo normas muito grandes
            selected_indices = capture_components_by_variance_and_norm(explained_variance_ratio, norms_category)
            
            # Armazenar os resultados de normas e componentes selecionados
            all_norms.append(norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices))
        
        # Inicializar os dicionários se as chaves não existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se não houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue

        # Plotar os resultados para os componentes selecionados
        plt.figure(figsize=(10, 6))
        plt.bar(aggregated_selected_indices, [np.mean([norms[i] for norms in all_norms if i < len(norms)]) for i in aggregated_selected_indices], 
                color='green' if category != other_category else 'blue',
                label=f'{category} on {other_category} - Selected Components')
        plt.title(f'Selected Components Based on ~90% Variance and Smallest Norms ({category} on {other_category}) - 95% Variance Explained')
        plt.xlabel('Component Index')
        plt.ylabel('Mean of Norms')
        plt.legend()
        plt.show()


In [None]:
import os
import numpy as np

# Parâmetros
patch_size = (32, 32)

def project_and_transform_back(data, pca, specific_indices):
    """
    Projeta os dados nos componentes principais específicos e reconstrói a partir desses componentes.
    """
    # Projeção dos patches nos componentes principais
    projected = pca.transform(data)
    
    # Usar apenas os componentes específicos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir os patches apenas com os componentes específicos
    specific_components = pca.components_[specific_indices]
    reconstructed_patches = np.dot(projected_specific, specific_components)
    
    return reconstructed_patches

def calculate_mean_ood_for_specific_components(original_patches_by_image, pca, specific_indices):
    """
    Projeta os patches originais em componentes PCA específicos, reconstrói e calcula a média dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as imagens
    for image_id, (patches, _) in original_patches_by_image.items():
        # Projeção e reconstrução dos patches nos componentes específicos
        reconstructed_patches = project_and_transform_back(patches, pca, specific_indices)
        
        # Calcula os resíduos (erro de reconstrução)
        residuals = patches - reconstructed_patches
        
        # Calcular a pontuação OOD (norma dos resíduos sobre a norma dos patches originais)
        original_norms = np.linalg.norm(patches, axis=1)
        residual_norms = np.linalg.norm(residuals, axis=1)
        
        # Calcular a pontuação OOD para todos os patches
        ood_scores = residual_norms / original_norms
        
        # Adiciona as pontuações OOD desta imagem à lista total
        total_ood_scores.extend(ood_scores)
    
    # Retorna a média das pontuações OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as médias das pontuações OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        # Verificar se os índices específicos existem para esta combinação de categoria
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se os patches de teste existem para a categoria
        if category not in centered_test_patches_by_category:
            print(f"Warning: No test patches found for {category}. Skipping.")
            continue
        
        # Calcular a média das pontuações OOD com base na projeção nos componentes específicos
        mean_ood = calculate_mean_ood_for_specific_components(centered_test_patches_by_category[category], pca_object, specific_indices)
        
        # Armazenar a média no dicionário
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as médias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Lista de categorias disponíveis
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Inicializar uma matriz vazia para armazenar os OOD scores
ood_score_matrix = np.full((len(categories), len(categories)), np.nan)

# Preencher a matriz com os OOD scores calculados
for i, test_category in enumerate(categories):
    for j, train_category in enumerate(categories):
        key = f"{test_category}_on_{train_category}"
        if key in mean_ood_scores:
            ood_score_matrix[i, j] = mean_ood_scores[key]

# Criar um DataFrame a partir da matriz de OOD scores para facilitar o plot
ood_score_df = pd.DataFrame(ood_score_matrix, index=categories, columns=categories)

# Plotar o heatmap usando apenas matplotlib
fig, ax = plt.subplots(figsize=(12, 8))

# Criar o heatmap com imshow
cax = ax.imshow(ood_score_df, cmap="coolwarm", aspect="auto")

# Adicionar os valores na matriz
for i in range(len(categories)):
    for j in range(len(categories)):
        value = ood_score_matrix[i, j]
        if not np.isnan(value):
            ax.text(j, i, f'{value:.4f}', ha='center', va='center', color='black')

# Configurar os eixos
ax.set_xticks(np.arange(len(categories)))
ax.set_yticks(np.arange(len(categories)))
ax.set_xticklabels(categories, rotation=45, ha="right")
ax.set_yticklabels(categories)

# Adicionar título e rótulos dos eixos
ax.set_title('Heatmap of OOD Scores for Test and Train Categories')
ax.set_xlabel('Train Category')
ax.set_ylabel('Test Category')

# Adicionar a barra de cores (colorbar)
fig.colorbar(cax, ax=ax, label='OOD Score')

# Ajustar layout
plt.tight_layout()

# Mostrar o heatmap
plt.show()
