# üéâ Out-of-Distribution (OOD) with PCA using Deep Features from the Latent Space

The goal of this notebook is to understand the depths of using Principal Component Analysis in order to perform OOD tasks using deep features from the latent space

## üìù Plan of action

### ‚ôªÔ∏è Preprocessing phase

In order to achieve our goal, we need to understand how the dataset is structured.

For this notebook, we are going to use the CBIR 15 dataset, that contains images of different places, such as an office, a bedroom, a mountain, etc. Note that there are some places that are similar one to another, i.e. a bedroom and a living room.

Thus, in order to extract the features of the images we have to preprocess those images:

- Get the images that are located in data/CBIR_15-scene and fit them to a dataframe using Pandas
  - Locate the "Labels.txt" file: it shows where the indexes of the images from each category starts
- Create the dataset with this information with two columns: the path to the image and its category
- Transform all of the images in the same size (in this case, we are going with 256x256)
  
Now, in order to extract the features, it's necessary to divide the reshaped images into patches of 32x32 pixels. This is good to perform processing tasks to avoid waiting long periods of time.

After all the preprocess, we should separate the images into two different foldes: one contains the patches of the training images that is going to give us their principal components and dimensions, and the other is the patches of the test images, that is going to be tested to fit into those dimensions and we'll get an OOD score afterwards.

### üèãüèΩ‚Äç‚ôÇÔ∏è Training phase

With the images that are stored inside the "patches_train" folder, the first thing we are going to do is _normalize_ all of the images to find the correct maximum covariance and transforming all the variables into the same scale.

Next, we should then apply the PCA with all the components. As we have patches of 32x32, we'll be having 1024 features, hence components. Then we plot a graph to see how many components truly contributes for the most variance of the data - and give us more information about it. We're going to take the threshold of 95% of variance in this notebook.

After getting the PCA with components that describe 95% of the variance, it's time to test our images and see how far of the residual space their data can be found.

### ‚öóÔ∏è Test phase and results

In this phase, we take the test images and normalize then with the same scale of each PCA. This is important to maintain consistency throughout the final results and measure the norms in the new dimension properly.

After that, we calculate the norm of the projection of the given data into the orthogonal space of the principal component and divide it by the norm of the data in relation to the origin. This is the OOD score.

We calculate the mean of the score for each category and get the minimal one. The current environment is the smallest.


--------------------------

First of all, we need to understand which libraries we are going to use:

- os: Deals with the operation system interface such as finding the relative and absolute path of files inside a project and reading/writing files for example.
- sys: This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.
- numpy: NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
- pandas: Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- matplotlib: Deals with plotting graphs to visualize data in a graphical way.
- sklearn: Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators.

In [None]:
import os
import sys
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


I'd suggest to use a conda virtual environment in order to avoid messing up your base kernel environment and causing dependency errors in the future.

After you successfully installed all the modules, it's time to import our custom modules that are going to deal with:

- Creation of our dataframe using pandas
- Separation of our dataset into patches of 32x32 in folders of training and test

In [None]:

sys.path.append(os.path.abspath('..'))

from dataframe_generator import *
from images_standardizing import *

In [None]:
import tarfile

def extract_tgz(tgz_path, extract_to):
    if not os.path.exists(extract_to):
        os.makedirs(extract_to)
    
    with tarfile.open(tgz_path, 'r:gz') as tar:
        tar.extractall(path=extract_to)
        print(f"Arquivos extra√≠dos para {extract_to}")

tgz_path = '../CBIR_15-Scene.tgz'
extract_to = '../data/'

extract_tgz(tgz_path, extract_to)

In [None]:
df = create_dataframe()
df

## ‚òùÔ∏è Part I: Comparing two different environments

### ‚ôªÔ∏è Preprocessing phase

Now we start our experiments to understand if our idea work, however this time we are going to understand what happens with our approach using two different environments.

In our case, I'm going to take the **Coast** and **Office** environments arbitrarily.


In [None]:
train_categories = ['Coast', 'Office']

df_different = df[df['category'].isin(train_categories)]
df_different

It's time to separate our dataset into train and test. We should use the built-in function of sklearn to do this:

In [None]:
X = df_different['image_path'].tolist()
y = df_different['category'].tolist()
unique_categories = list(df_different['category'].unique())
print(f"Unique categories: {unique_categories}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

standard_size = (224, 224)

Making sure that everything went well, we plot the grid of all the patches from the first image of our training set

This is exactly what the module that's inside our "image_patching.py" do. So we now, need to save everything into the subfolders by calling that function:

In [None]:
create_images_set(X_train, X_test, y_train, y_test, output_dir_train='images_train', output_dir_test='images_test', standard_size=standard_size)

Now, we should load our patches for training:

In [None]:
#training_images_by_category = load_images_by_category('images_train', y, image_size=(224, 224))
training_images_by_category = load_images_by_category('images_train', unique_categories, image_size=(224, 224))


In [None]:
def center_images(images):
    # Calcular a m√©dia ao longo do eixo dos pixels
    # Check if images have 3 or 4 dimensions
    if len(images.shape) == 3:
        num_images, height, width = images.shape
        # For grayscale images, no need for the 'channels' dimension
        mean_image = np.mean(images, axis=(1, 2), keepdims=True)
    elif len(images.shape) == 4:
        num_images, height, width, channels = images.shape
        mean_image = np.mean(images, axis=(1, 2, 3), keepdims=True)
    else:
        raise ValueError("Unexpected image shape")

    # Subtract the mean from each image
    centered_images = images - mean_image
    
    return centered_images

centered_images_by_category = {}
for category, images in training_images_by_category.items():
    print(images.shape)
    centered_images = center_images(images)
    centered_images_by_category[category] = centered_images
    print(f"Category {category}, images shape: {centered_images.shape}")


### üèãüèΩ‚Äç‚ôÇÔ∏è Training phase

Now that the have our training patches stored in that variable above, we should start our analysis with PCA.

First of all, we **need to normalize and center** the data. It's so importantt that I had to emphasize it. Plus, since we are dealing with different categories, each one of them should be normalized with a different scaler (and we're going to save it for later).

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.models import Model

In [None]:
preprocessed_images_by_category = centered_images_by_category 

base_model = VGG16(weights='imagenet', include_top=True)
base_model.summary()


In [None]:
# Getting the before last layer (Fully connected)
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)

In [None]:
import numpy as np

def convert_grayscale_to_rgb(images):
    return np.stack((images,) * 3, axis=-1)

features_by_category = {}
for category, images in preprocessed_images_by_category.items():
    # Verifica se a imagem est√° em grayscale
    if images.shape[-1] != 3:
        images = convert_grayscale_to_rgb(images)
    features = model.predict(images)
    features_by_category[category] = features

for category, features in features_by_category.items():
    print(f"Category {category}, features shape: {features.shape}")


In [None]:
pca_by_category = {}
explained_variance_by_category = {}

for category, features in features_by_category.items():
    pca = PCA(n_components=0.95)  
    principal_components = pca.fit_transform(features)
    pca_by_category[category] = pca
    explained_variance_by_category[category] = pca.explained_variance_ratio_
    
    print(f"Category {category}, principal components: {principal_components.shape[1]}")

for category, pca in pca_by_category.items():
    print(f"Category {category}, principal components shape: {pca.components_.shape}")
    print(f"Category {category}, explained variance: {np.sum(explained_variance_by_category[category]) * 100:.2f}%")


### Testing phase


In [None]:
def load_and_preprocess_test_images(test_dir, categories, image_size, input_size):
    test_images_by_category = load_images_by_category(test_dir, categories, image_size)
    test_centered_images_by_category = {}
    test_scalers_by_category = {}

    for category, images in test_images_by_category.items():
        test_centered_images = center_images(images)
        test_centered_images_by_category[category] = test_centered_images

    return test_centered_images_by_category

image_size = (224, 224)

test_preprocessed_images_by_category = load_and_preprocess_test_images('images_test', y, image_size, input_size=(224,224))


In [None]:
import numpy as np

def ensure_rgb_format(images):
    # Verifique se as imagens t√™m tr√™s dimens√µes (batch_size, height, width)
    if len(images.shape) == 3:  
        # Se for grayscale, expanda a dimens√£o para criar imagens com 3 canais (RGB)
        images = np.stack((images,) * 3, axis=-1)
    return images

def extract_features_with_vgg16(model, preprocessed_images_by_category):
    features_by_category = {}
    for category, images in preprocessed_images_by_category.items():
        # Garanta que as imagens est√£o no formato RGB correto
        images = ensure_rgb_format(images)
        
        # Realize a predi√ß√£o com o modelo
        features = model.predict(images)
        features_by_category[category] = features
    return features_by_category

# Agora extraia as caracter√≠sticas usando o modelo
test_features_by_category = extract_features_with_vgg16(model, test_preprocessed_images_by_category)


In [None]:
def centralize_features(features_by_category):
    centralized_features_by_category = {}
    for category, features in features_by_category.items():
        # Centralize as features subtraindo a m√©dia
        mean_features = np.mean(features, axis=0)
        centralized_features = features - mean_features
        centralized_features_by_category[category] = centralized_features
        
        print(f"Category {category}: centralized features shape = {centralized_features.shape}")
        print(f"Category {category}: mean of centralized features = {np.mean(centralized_features, axis=0)}")  # Deve estar pr√≥ximo de 0
    return centralized_features_by_category

centralized_test_features_by_category = centralize_features(test_features_by_category)

In [None]:
def calculate_reconstruction_error(test_features, pca_by_category):
    reconstruction_errors_by_category = {}
    mean_reconstruction_errors_by_category = {}
    
    for category, pca in pca_by_category.items():
        principal_components = pca.transform(test_features)
        reconstructed_features = pca.inverse_transform(principal_components)
        
        reconstruction_error = np.linalg.norm(test_features - reconstructed_features, axis=1)
        reconstruction_errors_by_category[category] = reconstruction_error / np.linalg.norm(test_features)

    for category, errors in reconstruction_errors_by_category.items():
        mean_reconstruction_errors_by_category[category] = np.mean(errors)
    
    best_category = min(mean_reconstruction_errors_by_category, key=mean_reconstruction_errors_by_category.get)

    for category in mean_reconstruction_errors_by_category:
        print(f"Category {category}, mean reconstruction error: {mean_reconstruction_errors_by_category[category]}")
    
    print(f"Best category: {best_category}")
    print("=====================================")

    return mean_reconstruction_errors_by_category, best_category

for category, test_features in centralized_test_features_by_category.items():
    print(f"Test category: {category}")
    mean_reconstruction_errors, best_category = calculate_reconstruction_error(test_features, pca_by_category)


## Agnostic Spaces Analsys

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Inicializar o dicion√°rio para armazenar os resultados de PCA
pca_results = {}

# Lista de percentuais de vari√¢ncia explicada para os quais voc√™ quer calcular
percentages = [95]
categories = ['Coast', 'Office']

# Loop atrav√©s de diferentes percentuais de vari√¢ncia explicada
for perc in percentages:
    # Inicializar dicion√°rios para armazenar os resultados
    pca_by_category = {}
    explained_variance_by_category = {}

    # Loop atrav√©s de categorias
    for category, features in features_by_category.items():
        # Inicializar PCA com a porcentagem especificada
        pca = PCA(n_components=perc / 100.0)
        principal_components = pca.fit_transform(features)
        
        # Armazenar os resultados do PCA para cada categoria
        pca_by_category[category] = pca
        explained_variance_by_category[category] = pca.explained_variance_ratio_
        
        print(f"Category {category}, principal components: {principal_components.shape[1]}")

    # Armazenar resultados no dicion√°rio principal pca_results
    pca_results[perc] = {}
    for category in categories:
        if category in pca_by_category:
            pca = pca_by_category[category]
            components = pca.components_
            explained_variance_ratio = pca.explained_variance_ratio_
            
            # Armazenar os componentes, a vari√¢ncia explicada e o objeto PCA
            pca_results[perc][category] = {
                'components': components,
                'explained_variance_ratio': explained_variance_ratio,
                'pca_object': pca
            }
            
            print(f"Category {category}, principal components shape: {components.shape}")
            print(f"Category {category}, explained variance: {np.sum(explained_variance_ratio) * 100:.2f}%")
        else:
            print(f"Categoria '{category}' n√£o est√° presente nos dados para {perc}%.")


In [None]:
def plot_cumulative_variance(explained_variance_ratio):
    """
    Plota a vari√¢ncia explicada acumulada com base na vari√¢ncia explicada de cada componente principal.
    
    Parameters:
    - explained_variance_ratio: Array ou lista contendo a vari√¢ncia explicada por componente.
    """
    cumulative_variance = np.cumsum(explained_variance_ratio)  # Calcula a vari√¢ncia acumulada

    plt.figure(figsize=(8, 5))
    plt.plot(np.arange(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
    plt.title('Cumulative Explained Variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.grid(True)
    plt.show()

In [None]:
# Verificar se a estrutura de pca_results cont√©m a vari√¢ncia explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Coast'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
# Verificar se a estrutura de pca_results cont√©m a vari√¢ncia explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Office'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    # Certifique-se de que patches tem pelo menos 2 dimens√µes
    if len(patches.shape) == 1:
        patches = np.expand_dims(patches, axis=0)  # Expande para (1, n_features)
    
    return np.dot(patches, pca_components.T)

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Verificar se projections_A e projections_B t√™m pelo menos duas dimens√µes
    if len(projections_A.shape) < 2 or len(projections_B.shape) < 2:
        raise ValueError("Projections must have at least two dimensions (patches, components).")

    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para plotar as normas m√©dias para todas as imagens combinadas
def plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color='blue'):
    plt.figure(figsize=(10, 6))
    
    # Plotar os resultados para todos os componentes combinados em todas as imagens
    plt.bar(range(len(mean_of_means_norms)), mean_of_means_norms, color=color,
            label=f'{category} on {other_category} - All Images')
    
    plt.title(f'Mean of Norms for Components ({category} on {other_category}) - 95% Variance Explained')
    plt.xlabel('Component Index')
    plt.ylabel('Mean of Norms')
    plt.legend()
    plt.show()

# Lista de categorias para iterar
categories = ['Coast', 'Office']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
            color = 'blue'
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
            color = 'green'
        
        # Armazenar as normas m√©dias para todas as imagens
        all_means_norms = []
        
        # Proje√ß√£o dos patches (intra ou cross-categoria) para todas as imagens
        # Itera sobre cada imagem no array de features
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Verifique a forma de projected_patches
            if len(projected_patches.shape) < 2 or projected_patches.shape[1] != components.shape[0]:
                raise ValueError(f"Projected patches have unexpected shape: {projected_patches.shape}. Expected at least 2 dimensions and components matching PCA.")
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            _, _, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Armazenar as normas calculadas para a imagem
            all_means_norms.append(means_norms_category)
        
        # Calcular a m√©dia das normas para todas as imagens
        mean_of_means_norms = np.mean(all_means_norms, axis=0)  # M√©dia das normas em todas as imagens
        
        # Plotar os valores m√©dios das normas para todos os componentes combinados
        plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color=color)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    projected = np.dot(patches, pca_components.T)
    
    # Verifica se o resultado √© unidimensional e expande para duas dimens√µes, se necess√°rio
    if len(projected.shape) == 1:
        projected = np.expand_dims(projected, axis=0)
    
    return projected

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Certifique-se de que projections_A tenha pelo menos duas dimens√µes
    if len(projections_A.shape) < 2:
        raise ValueError("projections_A must have at least two dimensions (patches, components).")
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para capturar componentes que explicam ~90% da vari√¢ncia e cujas normas est√£o abaixo do percentil desejado e de um limite superior
def capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms, variance_threshold=0.9, exclude_first=True, norm_threshold=50):
    # Calcular a vari√¢ncia explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os √≠ndices que explicam at√© ~90% da vari√¢ncia
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir a primeira componente se necess√°rio
    if exclude_first and 0 in selected_indices:
        selected_indices = selected_indices[selected_indices != 0]
    
    # Garantir que selected_indices seja uma lista de inteiros
    selected_indices = list(map(int, selected_indices))
    
    # Calcular o percentil desejado dos means_norms
    percentile = np.percentile([means_norms[i] for i in selected_indices], 7)
    
    # Selecionar os componentes com means_norms abaixo do percentil e do limite superior
    selected_indices_filtered = [i for i in selected_indices if means_norms[i] <= percentile and means_norms[i] <= norm_threshold]
    
    return selected_indices_filtered

# Lista de categorias para iterar
categories = ['Coast', 'Office']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e vari√¢ncia explicada da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Proje√ß√£o dos patches (intra ou cross-categoria)
        all_means_norms = []
        all_selected_indices = []
        
        # Iterando pelas imagens e patches
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes cujas m√©dias das normas est√£o abaixo do percentil e explicam at√© ~90% da vari√¢ncia
            selected_indices = capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms_category, exclude_first=True, norm_threshold=50)
            
            # Armazenar os resultados de m√©dias das normas e componentes selecionados
            all_means_norms.append(means_norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices)).astype(int)  # Convers√£o para inteiros
        
        # Inicializar os dicion√°rios se as chaves n√£o existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se n√£o houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue

        # Plotar os resultados para os componentes selecionados
        plt.figure(figsize=(10, 6))
        plt.bar(aggregated_selected_indices, [np.mean([means_norms[int(i)] for means_norms in all_means_norms if int(i) < len(means_norms)]) for i in aggregated_selected_indices], 
                color='green' if category != other_category else 'blue',
                label=f'{category} on {other_category} - Selected Components')
        plt.title(f'Selected Components Based on ~90% Variance and Below 7th Percentile of Mean Norms ({category} on {other_category}) - 95% Variance Explained')
        plt.xlabel('Component Index')
        plt.ylabel('Mean of Norms')
        plt.legend()
        plt.show()


In [None]:
centered_test_office_patches = centralized_test_features_by_category['Office']
centered_test_coast_patches = centralized_test_features_by_category['Coast']

In [None]:
import os
import numpy as np

def project_and_transform_back(features, pca, specific_indices):
    """
    Projeta as features nos componentes principais espec√≠ficos e reconstr√≥i a partir desses componentes.
    """
    # Proje√ß√£o das features nos componentes principais
    projected = pca.transform(features)
    
    # Usar apenas os componentes espec√≠ficos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir as features apenas com os componentes espec√≠ficos
    specific_components = pca.components_[specific_indices]
    reconstructed_features = np.dot(projected_specific, specific_components)
    
    return reconstructed_features

def calculate_mean_ood_for_specific_components(original_features, pca, specific_indices):
    """
    Projeta as features originais em componentes PCA espec√≠ficos, reconstr√≥i e calcula a m√©dia dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as amostras de features
    for sample_idx in range(original_features.shape[0]):
        features = original_features[sample_idx]
        
        # Proje√ß√£o e reconstru√ß√£o das features nos componentes espec√≠ficos
        reconstructed_features = project_and_transform_back(features.reshape(1, -1), pca, specific_indices)
        
        # Calcula os res√≠duos (erro de reconstru√ß√£o)
        residuals = features - reconstructed_features.flatten()
        
        # Calcular a norma das features originais e dos res√≠duos
        original_norm = np.linalg.norm(features)
        residual_norm = np.linalg.norm(residuals)
        
        # Verifique se a norma dos res√≠duos √© maior que a norma das features originais
        if residual_norm > original_norm:
            print(f"Warning: Residual norm ({residual_norm}) greater than original norm ({original_norm}) for sample {sample_idx}")
        
        # Calcular a pontua√ß√£o OOD (norma dos res√≠duos sobre a norma das features originais)
        if original_norm == 0:
            ood_score = 0
        else:
            ood_score = residual_norm / original_norm
        
        # Adiciona a pontua√ß√£o OOD desta amostra √† lista total
        total_ood_scores.append(ood_score)
    
    # Retorna a m√©dia das pontua√ß√µes OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as m√©dias das pontua√ß√µes OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se as features de teste existem para a categoria
        if category not in centralized_test_features_by_category:
            print(f"Warning: No test features found for {category}. Skipping.")
            continue
        
        # Calcular a m√©dia das pontua√ß√µes OOD com base na proje√ß√£o nos componentes espec√≠ficos
        mean_ood = calculate_mean_ood_for_specific_components(centralized_test_features_by_category[category], pca_object, specific_indices)
        
        # Armazenar a m√©dia no dicion√°rio
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as m√©dias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


## ‚úåÔ∏è Part II: Comparing two similar environments

In [None]:
train_categories = ['Bedroom', 'LivingRoom']

df_different = df[df['category'].isin(train_categories)]
df_different

In [None]:
X = df_different['image_path']
y = df_different['category']
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=10)

image_size = (224, 224)
unique_categories = list(df_different['category'].unique())
print(f"Unique categories: {unique_categories}")


In [None]:
create_images_set(X_train, X_test, y_train, y_test, output_dir_train='images_train', output_dir_test='images_test', standard_size=standard_size)

In [None]:
training_images_by_category = load_images_by_category('images_train', unique_categories, image_size=(224, 224))


In [None]:
def center_images(images):
    # Calcular a m√©dia ao longo do eixo dos pixels
    # Check if images have 3 or 4 dimensions
    if len(images.shape) == 3:
        num_images, height, width = images.shape
        # For grayscale images, no need for the 'channels' dimension
        mean_image = np.mean(images, axis=(1, 2), keepdims=True)
    elif len(images.shape) == 4:
        num_images, height, width, channels = images.shape
        mean_image = np.mean(images, axis=(1, 2, 3), keepdims=True)
    else:
        raise ValueError("Unexpected image shape")

    # Subtract the mean from each image
    centered_images = images - mean_image
    
    return centered_images

centered_images_by_category = {}
for category, images in training_images_by_category.items():
    print(images.shape)
    centered_images = center_images(images)
    centered_images_by_category[category] = centered_images
    print(f"Category {category}, images shape: {centered_images.shape}")


In [None]:
import numpy as np

def convert_grayscale_to_rgb(images):
    return np.stack((images,) * 3, axis=-1)

preprocessed_images_by_category = centered_images_by_category 
features_by_category = {}
for category, images in preprocessed_images_by_category.items():
    # Verifica se a imagem est√° em grayscale
    if images.shape[-1] != 3:
        images = convert_grayscale_to_rgb(images)
    features = model.predict(images)
    features_by_category[category] = features


for category, features in features_by_category.items():
    print(f"Category {category}, features shape: {features.shape}")


In [None]:
pca_by_category = {}
explained_variance_by_category = {}

for category, features in features_by_category.items():
    pca = PCA(n_components=0.95)  
    principal_components = pca.fit_transform(features)
    pca_by_category[category] = pca
    explained_variance_by_category[category] = pca.explained_variance_ratio_
    
    print(f"Category {category}, principal components: {principal_components.shape[1]}")

for category, pca in pca_by_category.items():
    print(f"Category {category}, principal components shape: {pca.components_.shape}")
    print(f"Category {category}, explained variance: {np.sum(explained_variance_by_category[category]) * 100:.2f}%")


# Test

In [None]:
def load_and_preprocess_test_images(test_dir, categories, image_size, input_size):
    test_images_by_category = load_images_by_category(test_dir, categories, image_size)
    test_centered_images_by_category = {}
    test_scalers_by_category = {}

    for category, images in test_images_by_category.items():
        test_centered_images = center_images(images)
        test_centered_images_by_category[category] = test_centered_images

    return test_centered_images_by_category

image_size = (224, 224)

test_preprocessed_images_by_category = load_and_preprocess_test_images('images_test', y, image_size, input_size=(224,224))


In [None]:
import numpy as np

def ensure_rgb_format(images):
    # Verifique se as imagens t√™m tr√™s dimens√µes (batch_size, height, width)
    if len(images.shape) == 3:  
        # Se for grayscale, expanda a dimens√£o para criar imagens com 3 canais (RGB)
        images = np.stack((images,) * 3, axis=-1)
    return images

def extract_features_with_vgg16(model, preprocessed_images_by_category):
    features_by_category = {}
    for category, images in preprocessed_images_by_category.items():
        # Garanta que as imagens est√£o no formato RGB correto
        images = ensure_rgb_format(images)
        
        # Realize a predi√ß√£o com o modelo
        features = model.predict(images)
        features_by_category[category] = features
    return features_by_category

# Agora extraia as caracter√≠sticas usando o modelo
test_features_by_category = extract_features_with_vgg16(model, test_preprocessed_images_by_category)


In [None]:
def centralize_features(features_by_category):
    centralized_features_by_category = {}
    for category, features in features_by_category.items():
        # Centralize as features subtraindo a m√©dia
        mean_features = np.mean(features, axis=0)
        centralized_features = features - mean_features
        centralized_features_by_category[category] = centralized_features
        
        print(f"Category {category}: centralized features shape = {centralized_features.shape}")
        print(f"Category {category}: mean of centralized features = {np.mean(centralized_features, axis=0)}")  # Deve estar pr√≥ximo de 0
    return centralized_features_by_category

centralized_test_features_by_category = centralize_features(test_features_by_category)

In [None]:
def calculate_reconstruction_error(test_features, pca_by_category):
    reconstruction_errors_by_category = {}
    mean_reconstruction_errors_by_category = {}
    
    for category, pca in pca_by_category.items():
        principal_components = pca.transform(test_features)
        reconstructed_features = pca.inverse_transform(principal_components)
        
        reconstruction_error = np.linalg.norm(test_features - reconstructed_features, axis=1)
        reconstruction_errors_by_category[category] = reconstruction_error / np.linalg.norm(test_features)

    for category, errors in reconstruction_errors_by_category.items():
        mean_reconstruction_errors_by_category[category] = np.mean(errors)
    
    best_category = min(mean_reconstruction_errors_by_category, key=mean_reconstruction_errors_by_category.get)

    for category in mean_reconstruction_errors_by_category:
        print(f"Category {category}, mean reconstruction error: {mean_reconstruction_errors_by_category[category]}")
    
    print(f"Best category: {best_category}")
    print("=====================================")

    return mean_reconstruction_errors_by_category, best_category

for category, test_features in centralized_test_features_by_category.items():
    print(f"Test category: {category}")
    mean_reconstruction_errors, best_category = calculate_reconstruction_error(test_features, pca_by_category)


## Agnostic Spaces

In [None]:

from sklearn.decomposition import PCA
import numpy as np

# Inicializar o dicion√°rio para armazenar os resultados de PCA
pca_results = {}

# Lista de percentuais de vari√¢ncia explicada para os quais voc√™ quer calcular
percentages = [95]
categories = ['Bedroom', 'LivingRoom']

# Loop atrav√©s de diferentes percentuais de vari√¢ncia explicada
for perc in percentages:
    # Inicializar dicion√°rios para armazenar os resultados
    pca_by_category = {}
    explained_variance_by_category = {}

    # Loop atrav√©s de categorias
    for category, features in features_by_category.items():
        # Inicializar PCA com a porcentagem especificada
        pca = PCA(n_components=perc / 100.0)
        principal_components = pca.fit_transform(features)
        
        # Armazenar os resultados do PCA para cada categoria
        pca_by_category[category] = pca
        explained_variance_by_category[category] = pca.explained_variance_ratio_
        
        print(f"Category {category}, principal components: {principal_components.shape[1]}")

    # Armazenar resultados no dicion√°rio principal pca_results
    pca_results[perc] = {}
    for category in categories:
        if category in pca_by_category:
            pca = pca_by_category[category]
            components = pca.components_
            explained_variance_ratio = pca.explained_variance_ratio_
            
            # Armazenar os componentes, a vari√¢ncia explicada e o objeto PCA
            pca_results[perc][category] = {
                'components': components,
                'explained_variance_ratio': explained_variance_ratio,
                'pca_object': pca
            }
            
            print(f"Category {category}, principal components shape: {components.shape}")
            print(f"Category {category}, explained variance: {np.sum(explained_variance_ratio) * 100:.2f}%")
        else:
            print(f"Categoria '{category}' n√£o est√° presente nos dados para {perc}%.")


In [None]:
# Verificar se a estrutura de pca_results cont√©m a vari√¢ncia explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'Bedroom'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
# Verificar se a estrutura de pca_results cont√©m a vari√¢ncia explicada
percentage_to_use = 95  # Escolha a porcentagem que deseja usar
category_to_use = 'LivingRoom'  # Escolha a categoria para analisar

if percentage_to_use in pca_results:
    if category_to_use in pca_results[percentage_to_use]:
        explained_variance_ratio = pca_results[percentage_to_use][category_to_use].get('explained_variance_ratio', None)
        
        if explained_variance_ratio is not None:
            plot_cumulative_variance(explained_variance_ratio)
        else:
            print(f"Explained variance ratio not found for {category_to_use} at {percentage_to_use}%.")
    else:
        print(f"Category '{category_to_use}' not found for {percentage_to_use}%.")
else:
    print(f"Percentage '{percentage_to_use}%' not found in pca_results.")


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    # Certifique-se de que patches tem pelo menos 2 dimens√µes
    if len(patches.shape) == 1:
        patches = np.expand_dims(patches, axis=0)  # Expande para (1, n_features)
    
    return np.dot(patches, pca_components.T)

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Verificar se projections_A e projections_B t√™m pelo menos duas dimens√µes
    if len(projections_A.shape) < 2 or len(projections_B.shape) < 2:
        raise ValueError("Projections must have at least two dimensions (patches, components).")

    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para plotar as normas m√©dias para todas as imagens combinadas
def plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color='blue'):
    plt.figure(figsize=(10, 6))
    
    # Plotar os resultados para todos os componentes combinados em todas as imagens
    plt.bar(range(len(mean_of_means_norms)), mean_of_means_norms, color=color,
            label=f'{category} on {other_category} - All Images')
    
    plt.title(f'Mean of Norms for Components ({category} on {other_category}) - 95% Variance Explained')
    plt.xlabel('Component Index')
    plt.ylabel('Mean of Norms')
    plt.legend()
    plt.show()

# Lista de categorias para iterar
categories = ['Bedroom', 'LivingRoom']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
            color = 'blue'
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
            color = 'green'
        
        # Armazenar as normas m√©dias para todas as imagens
        all_means_norms = []
        
        # Proje√ß√£o dos patches (intra ou cross-categoria) para todas as imagens
        # Itera sobre cada imagem no array de features
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Verifique a forma de projected_patches
            if len(projected_patches.shape) < 2 or projected_patches.shape[1] != components.shape[0]:
                raise ValueError(f"Projected patches have unexpected shape: {projected_patches.shape}. Expected at least 2 dimensions and components matching PCA.")
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            _, _, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Armazenar as normas calculadas para a imagem
            all_means_norms.append(means_norms_category)
        
        # Calcular a m√©dia das normas para todas as imagens
        mean_of_means_norms = np.mean(all_means_norms, axis=0)  # M√©dia das normas em todas as imagens
        
        # Plotar os valores m√©dios das normas para todos os componentes combinados
        plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color=color)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    projected = np.dot(patches, pca_components.T)
    
    # Verifica se o resultado √© unidimensional e expande para duas dimens√µes, se necess√°rio
    if len(projected.shape) == 1:
        projected = np.expand_dims(projected, axis=0)
    
    return projected

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Certifique-se de que projections_A tenha pelo menos duas dimens√µes
    if len(projections_A.shape) < 2:
        raise ValueError("projections_A must have at least two dimensions (patches, components).")
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para capturar componentes que explicam ~90% da vari√¢ncia e cujas normas est√£o abaixo do percentil desejado e de um limite superior
def capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms, variance_threshold=0.9, exclude_first=True, norm_threshold=50):
    # Calcular a vari√¢ncia explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os √≠ndices que explicam at√© ~90% da vari√¢ncia
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir a primeira componente se necess√°rio
    if exclude_first and 0 in selected_indices:
        selected_indices = selected_indices[selected_indices != 0]
    
    # Garantir que selected_indices seja uma lista de inteiros
    selected_indices = list(map(int, selected_indices))
    
    # Calcular o percentil desejado dos means_norms
    percentile = np.percentile([means_norms[i] for i in selected_indices], 7)
    
    # Selecionar os componentes com means_norms abaixo do percentil e do limite superior
    selected_indices_filtered = [i for i in selected_indices if means_norms[i] <= percentile and means_norms[i] <= norm_threshold]
    
    return selected_indices_filtered

# Lista de categorias para iterar
categories = ['Bedroom', 'LivingRoom']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e vari√¢ncia explicada da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Proje√ß√£o dos patches (intra ou cross-categoria)
        all_means_norms = []
        all_selected_indices = []
        
        # Iterando pelas imagens e patches
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes cujas m√©dias das normas est√£o abaixo do percentil e explicam at√© ~90% da vari√¢ncia
            selected_indices = capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms_category, exclude_first=True, norm_threshold=50)
            
            # Armazenar os resultados de m√©dias das normas e componentes selecionados
            all_means_norms.append(means_norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices)).astype(int)  # Convers√£o para inteiros
        
        # Inicializar os dicion√°rios se as chaves n√£o existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se n√£o houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue

        # Plotar os resultados para os componentes selecionados
        plt.figure(figsize=(10, 6))
        plt.bar(aggregated_selected_indices, [np.mean([means_norms[int(i)] for means_norms in all_means_norms if int(i) < len(means_norms)]) for i in aggregated_selected_indices], 
                color='green' if category != other_category else 'blue',
                label=f'{category} on {other_category} - Selected Components')
        plt.title(f'Selected Components Based on ~90% Variance and Below 7th Percentile of Mean Norms ({category} on {other_category}) - 95% Variance Explained')
        plt.xlabel('Component Index')
        plt.ylabel('Mean of Norms')
        plt.legend()
        plt.show()


In [None]:
import os
import numpy as np

def project_and_transform_back(features, pca, specific_indices):
    """
    Projeta as features nos componentes principais espec√≠ficos e reconstr√≥i a partir desses componentes.
    """
    # Proje√ß√£o das features nos componentes principais
    projected = pca.transform(features)
    
    # Usar apenas os componentes espec√≠ficos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir as features apenas com os componentes espec√≠ficos
    specific_components = pca.components_[specific_indices]
    reconstructed_features = np.dot(projected_specific, specific_components)
    
    return reconstructed_features

def calculate_mean_ood_for_specific_components(original_features, pca, specific_indices):
    """
    Projeta as features originais em componentes PCA espec√≠ficos, reconstr√≥i e calcula a m√©dia dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as amostras de features
    for sample_idx in range(original_features.shape[0]):
        features = original_features[sample_idx]
        
        # Proje√ß√£o e reconstru√ß√£o das features nos componentes espec√≠ficos
        reconstructed_features = project_and_transform_back(features.reshape(1, -1), pca, specific_indices)
        
        # Calcula os res√≠duos (erro de reconstru√ß√£o)
        residuals = features - reconstructed_features.flatten()
        
        # Calcular a norma das features originais e dos res√≠duos
        original_norm = np.linalg.norm(features)
        residual_norm = np.linalg.norm(residuals)
        
        # Verifique se a norma dos res√≠duos √© maior que a norma das features originais
        if residual_norm > original_norm:
            print(f"Warning: Residual norm ({residual_norm}) greater than original norm ({original_norm}) for sample {sample_idx}")
        
        # Calcular a pontua√ß√£o OOD (norma dos res√≠duos sobre a norma das features originais)
        if original_norm == 0:
            ood_score = 0
        else:
            ood_score = residual_norm / original_norm
        
        # Adiciona a pontua√ß√£o OOD desta amostra √† lista total
        total_ood_scores.append(ood_score)
    
    # Retorna a m√©dia das pontua√ß√µes OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as m√©dias das pontua√ß√µes OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se as features de teste existem para a categoria
        if category not in centralized_test_features_by_category:
            print(f"Warning: No test features found for {category}. Skipping.")
            continue
        
        # Calcular a m√©dia das pontua√ß√µes OOD com base na proje√ß√£o nos componentes espec√≠ficos
        mean_ood = calculate_mean_ood_for_specific_components(centralized_test_features_by_category[category], pca_object, specific_indices)
        
        # Armazenar a m√©dia no dicion√°rio
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as m√©dias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


## All environments

In [None]:
X = df['image_path'].tolist()
y = df['category'].tolist()
unique_categories = list(df['category'].unique())
print(f"Unique categories: {unique_categories}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

standard_size = (224, 224)

In [None]:
create_images_set(X_train, X_test, y_train, y_test, output_dir_train='images_train', output_dir_test='images_test', standard_size=standard_size)

In [None]:
training_images_by_category = load_images_by_category('images_train', unique_categories, image_size=(224, 224))


In [None]:
def center_images(images):
    # Calcular a m√©dia ao longo do eixo dos pixels
    # Check if images have 3 or 4 dimensions
    if len(images.shape) == 3:
        num_images, height, width = images.shape
        # For grayscale images, no need for the 'channels' dimension
        mean_image = np.mean(images, axis=(1, 2), keepdims=True)
    elif len(images.shape) == 4:
        num_images, height, width, channels = images.shape
        mean_image = np.mean(images, axis=(1, 2, 3), keepdims=True)
    else:
        raise ValueError("Unexpected image shape")

    # Subtract the mean from each image
    centered_images = images - mean_image
    
    return centered_images

centered_images_by_category = {}
for category, images in training_images_by_category.items():
    print(images.shape)
    centered_images = center_images(images)
    centered_images_by_category[category] = centered_images
    print(f"Category {category}, images shape: {centered_images.shape}")


In [None]:
import numpy as np

def convert_grayscale_to_rgb(images):
    return np.stack((images,) * 3, axis=-1)

preprocessed_images_by_category = centered_images_by_category 
features_by_category = {}
for category, images in preprocessed_images_by_category.items():
    # Verifica se a imagem est√° em grayscale
    if images.shape[-1] != 3:
        images = convert_grayscale_to_rgb(images)
    features = model.predict(images)
    features_by_category[category] = features


for category, features in features_by_category.items():
    print(f"Category {category}, features shape: {features.shape}")


In [None]:
pca_by_category = {}
explained_variance_by_category = {}

for category, features in features_by_category.items():
    pca = PCA(n_components=0.95)  
    principal_components = pca.fit_transform(features)
    pca_by_category[category] = pca
    explained_variance_by_category[category] = pca.explained_variance_ratio_
    
    print(f"Category {category}, principal components: {principal_components.shape[1]}")

for category, pca in pca_by_category.items():
    print(f"Category {category}, principal components shape: {pca.components_.shape}")
    print(f"Category {category}, explained variance: {np.sum(explained_variance_by_category[category]) * 100:.2f}%")


# Teste

In [None]:
def load_and_preprocess_test_images(test_dir, categories, image_size, input_size):
    test_images_by_category = load_images_by_category(test_dir, categories, image_size)
    test_centered_images_by_category = {}
    test_scalers_by_category = {}

    for category, images in test_images_by_category.items():
        test_centered_images = center_images(images)
        test_centered_images_by_category[category] = test_centered_images

    return test_centered_images_by_category

image_size = (224, 224)

test_preprocessed_images_by_category = load_and_preprocess_test_images('images_test', y, image_size, input_size=(224,224))


In [None]:
import numpy as np

def ensure_rgb_format(images):
    # Verifique se as imagens t√™m tr√™s dimens√µes (batch_size, height, width)
    if len(images.shape) == 3:  
        # Se for grayscale, expanda a dimens√£o para criar imagens com 3 canais (RGB)
        images = np.stack((images,) * 3, axis=-1)
    return images

def extract_features_with_vgg16(model, preprocessed_images_by_category):
    features_by_category = {}
    for category, images in preprocessed_images_by_category.items():
        # Garanta que as imagens est√£o no formato RGB correto
        images = ensure_rgb_format(images)
        
        # Realize a predi√ß√£o com o modelo
        features = model.predict(images)
        features_by_category[category] = features
    return features_by_category

# Agora extraia as caracter√≠sticas usando o modelo
test_features_by_category = extract_features_with_vgg16(model, test_preprocessed_images_by_category)


In [None]:
def centralize_features(features_by_category):
    centralized_features_by_category = {}
    for category, features in features_by_category.items():
        # Centralize as features subtraindo a m√©dia
        mean_features = np.mean(features, axis=0)
        centralized_features = features - mean_features
        centralized_features_by_category[category] = centralized_features
        
        print(f"Category {category}: centralized features shape = {centralized_features.shape}")
        print(f"Category {category}: mean of centralized features = {np.mean(centralized_features, axis=0)}")  # Deve estar pr√≥ximo de 0
    return centralized_features_by_category

centralized_test_features_by_category = centralize_features(test_features_by_category)

In [None]:
def calculate_reconstruction_error(test_features, pca_by_category):
    reconstruction_errors_by_category = {}
    mean_reconstruction_errors_by_category = {}
    
    for category, pca in pca_by_category.items():
        principal_components = pca.transform(test_features)
        reconstructed_features = pca.inverse_transform(principal_components)
        
        reconstruction_error = np.linalg.norm(test_features - reconstructed_features, axis=1)
        reconstruction_errors_by_category[category] = reconstruction_error / np.linalg.norm(test_features)

    for category, errors in reconstruction_errors_by_category.items():
        mean_reconstruction_errors_by_category[category] = np.mean(errors)
    
    best_category = min(mean_reconstruction_errors_by_category, key=mean_reconstruction_errors_by_category.get)

    for category in mean_reconstruction_errors_by_category:
        print(f"Category {category}, mean reconstruction error: {mean_reconstruction_errors_by_category[category]}")
    
    print(f"Best category: {best_category}")
    print("=====================================")

    return mean_reconstruction_errors_by_category, best_category

for category, test_features in centralized_test_features_by_category.items():
    print(f"Test category: {category}")
    mean_reconstruction_errors, best_category = calculate_reconstruction_error(test_features, pca_by_category)


# Agnostic

In [None]:

from sklearn.decomposition import PCA
import numpy as np

# Inicializar o dicion√°rio para armazenar os resultados de PCA
pca_results = {}

# Lista de percentuais de vari√¢ncia explicada para os quais voc√™ quer calcular
percentages = [95]
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Loop atrav√©s de diferentes percentuais de vari√¢ncia explicada
for perc in percentages:
    # Inicializar dicion√°rios para armazenar os resultados
    pca_by_category = {}
    explained_variance_by_category = {}

    # Loop atrav√©s de categorias
    for category, features in features_by_category.items():
        # Inicializar PCA com a porcentagem especificada
        pca = PCA(n_components=perc / 100.0)
        principal_components = pca.fit_transform(features)
        
        # Armazenar os resultados do PCA para cada categoria
        pca_by_category[category] = pca
        explained_variance_by_category[category] = pca.explained_variance_ratio_
        
        print(f"Category {category}, principal components: {principal_components.shape[1]}")

    # Armazenar resultados no dicion√°rio principal pca_results
    pca_results[perc] = {}
    for category in categories:
        if category in pca_by_category:
            pca = pca_by_category[category]
            components = pca.components_
            explained_variance_ratio = pca.explained_variance_ratio_
            
            # Armazenar os componentes, a vari√¢ncia explicada e o objeto PCA
            pca_results[perc][category] = {
                'components': components,
                'explained_variance_ratio': explained_variance_ratio,
                'pca_object': pca
            }
            
            print(f"Category {category}, principal components shape: {components.shape}")
            print(f"Category {category}, explained variance: {np.sum(explained_variance_ratio) * 100:.2f}%")
        else:
            print(f"Categoria '{category}' n√£o est√° presente nos dados para {perc}%.")


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    # Certifique-se de que patches tem pelo menos 2 dimens√µes
    if len(patches.shape) == 1:
        patches = np.expand_dims(patches, axis=0)  # Expande para (1, n_features)
    
    return np.dot(patches, pca_components.T)

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Verificar se projections_A e projections_B t√™m pelo menos duas dimens√µes
    if len(projections_A.shape) < 2 or len(projections_B.shape) < 2:
        raise ValueError("Projections must have at least two dimensions (patches, components).")

    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para plotar as normas m√©dias para todas as imagens combinadas
def plot_mean_norms_for_all_images(category, other_category, mean_of_means_norms, color='blue'):
    plt.figure(figsize=(10, 6))
    
    # Plotar os resultados para todos os componentes combinados em todas as imagens
    plt.bar(range(len(mean_of_means_norms)), mean_of_means_norms, color=color,
            label=f'{category} on {other_category} - All Images')
    
    plt.title(f'Mean of Norms for Components ({category} on {other_category}) - 95% Variance Explained')
    plt.xlabel('Component Index')
    plt.ylabel('Mean of Norms')
    plt.legend()
    plt.show()

# Lista de categorias para iterar
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
            color = 'blue'
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
            color = 'green'
        
        # Armazenar as normas m√©dias para todas as imagens
        all_means_norms = []
        
        # Proje√ß√£o dos patches (intra ou cross-categoria) para todas as imagens
        # Itera sobre cada imagem no array de features
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Verifique a forma de projected_patches
            if len(projected_patches.shape) < 2 or projected_patches.shape[1] != components.shape[0]:
                raise ValueError(f"Projected patches have unexpected shape: {projected_patches.shape}. Expected at least 2 dimensions and components matching PCA.")
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            _, _, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Armazenar as normas calculadas para a imagem
            all_means_norms.append(means_norms_category)
        
        # Calcular a m√©dia das normas para todas as imagens
        mean_of_means_norms = np.mean(all_means_norms, axis=0)  # M√©dia das normas em todas as imagens
        


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fun√ß√£o para projetar patches nos componentes de uma categoria
def project_test_patches(patches, pca_components):
    projected = np.dot(patches, pca_components.T)
    
    # Verifica se o resultado √© unidimensional e expande para duas dimens√µes, se necess√°rio
    if len(projected.shape) == 1:
        projected = np.expand_dims(projected, axis=0)
    
    return projected

# Fun√ß√£o para calcular normas, m√©dias e m√©dias das normas dos produtos internos no mesmo componente
def calculate_norms_and_means(projections_A, projections_B):
    norms = []
    means = []
    means_norms = []

    # Certifique-se de que projections_A tenha pelo menos duas dimens√µes
    if len(projections_A.shape) < 2:
        raise ValueError("projections_A must have at least two dimensions (patches, components).")
    
    for i in range(projections_A.shape[1]):  # Itera sobre os componentes
        dot_products = np.dot(projections_A[:, i], projections_B[:, i].T)
        norms.append(np.linalg.norm(dot_products))
        means.append(np.mean(dot_products))
        means_norms.append(np.mean(np.linalg.norm(dot_products)))
        
    return norms, means, means_norms

# Fun√ß√£o para capturar componentes que explicam ~90% da vari√¢ncia e cujas normas est√£o abaixo do percentil desejado e de um limite superior
def capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms, variance_threshold=0.9, exclude_first=True, norm_threshold=50):
    # Calcular a vari√¢ncia explicada cumulativa
    cumulative_variance = np.cumsum(explained_variance_ratio)
    
    # Capturar os √≠ndices que explicam at√© ~90% da vari√¢ncia
    selected_indices = np.where(cumulative_variance <= variance_threshold)[0]
    
    # Excluir a primeira componente se necess√°rio
    if exclude_first and 0 in selected_indices:
        selected_indices = selected_indices[selected_indices != 0]
    
    # Garantir que selected_indices seja uma lista de inteiros
    selected_indices = list(map(int, selected_indices))
    
    # Calcular o percentil desejado dos means_norms
    percentile = np.percentile([means_norms[i] for i in selected_indices], 7)
    
    # Selecionar os componentes com means_norms abaixo do percentil e do limite superior
    selected_indices_filtered = [i for i in selected_indices if means_norms[i] <= percentile and means_norms[i] <= norm_threshold]
    
    return selected_indices_filtered

# Lista de categorias para iterar
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Trabalhando apenas com o PCA de 95% de vari√¢ncia explicada
perc = 95

selected_indices_dict = {}

# Iterar sobre as categorias para calcular tanto intra-categories quanto cross-categories
for category in categories:
    for other_category in categories:
        # Carregar os componentes e vari√¢ncia explicada da pr√≥pria categoria ou da outra categoria
        components = pca_results[perc][other_category]['components']
        explained_variance_ratio = pca_results[perc][other_category]['explained_variance_ratio']
        
        if category == other_category:
            print(f"\nCategory: {category} on {category} (Intra-Category), Percentage: {perc}%")
        else:
            print(f"\nCategory: {category} on {other_category} (Cross-Category), Percentage: {perc}%")
        
        # Proje√ß√£o dos patches (intra ou cross-categoria)
        all_means_norms = []
        all_selected_indices = []
        
        # Iterando pelas imagens e patches
        for image_idx in range(centralized_test_features_by_category[category].shape[0]):
            patches = centralized_test_features_by_category[category][image_idx]  # Seleciona os patches da imagem

            # Projeta os patches nos componentes
            projected_patches = project_test_patches(patches, components)
            
            # Calcular normas, m√©dias e m√©dias das normas para cada imagem
            norms_category, means_category, means_norms_category = calculate_norms_and_means(projected_patches, projected_patches)
            
            # Capturar os componentes cujas m√©dias das normas est√£o abaixo do percentil e explicam at√© ~90% da vari√¢ncia
            selected_indices = capture_components_by_percentile_and_threshold(explained_variance_ratio, means_norms_category, exclude_first=True, norm_threshold=50)
            
            # Armazenar os resultados de m√©dias das normas e componentes selecionados
            all_means_norms.append(means_norms_category)
            all_selected_indices.append(selected_indices)
        
        # Verifique se existem componentes selecionados
        if len(all_selected_indices) == 0 or np.concatenate(all_selected_indices).size == 0:
            print(f"Warning: No components selected for {category} on {other_category}. Skipping this combination.")
            continue

        # Agregue os componentes selecionados em todas as imagens
        aggregated_selected_indices = np.unique(np.concatenate(all_selected_indices)).astype(int)  # Convers√£o para inteiros
        
        # Inicializar os dicion√°rios se as chaves n√£o existirem
        if category not in selected_indices_dict:
            selected_indices_dict[category] = {}
        
        selected_indices_dict[category][other_category] = aggregated_selected_indices

        # Evite plotagens se n√£o houver componentes selecionados
        if len(aggregated_selected_indices) == 0:
            print(f"Warning: No valid components selected for {category} on {other_category}. Skipping plot.")
            continue


In [None]:
import os
import numpy as np

def project_and_transform_back(features, pca, specific_indices):
    """
    Projeta as features nos componentes principais espec√≠ficos e reconstr√≥i a partir desses componentes.
    """
    # Proje√ß√£o das features nos componentes principais
    projected = pca.transform(features)
    
    # Usar apenas os componentes espec√≠ficos
    projected_specific = projected[:, specific_indices]
    
    # Reconstruir as features apenas com os componentes espec√≠ficos
    specific_components = pca.components_[specific_indices]
    reconstructed_features = np.dot(projected_specific, specific_components)
    
    return reconstructed_features

def calculate_mean_ood_for_specific_components(original_features, pca, specific_indices):
    """
    Projeta as features originais em componentes PCA espec√≠ficos, reconstr√≥i e calcula a m√©dia dos OOD scores.
    """
    total_ood_scores = []
    
    # Itera sobre todas as amostras de features
    for sample_idx in range(original_features.shape[0]):
        features = original_features[sample_idx]
        
        # Proje√ß√£o e reconstru√ß√£o das features nos componentes espec√≠ficos
        reconstructed_features = project_and_transform_back(features.reshape(1, -1), pca, specific_indices)
        
        # Calcula os res√≠duos (erro de reconstru√ß√£o)
        residuals = features - reconstructed_features.flatten()
        
        # Calcular a norma das features originais e dos res√≠duos
        original_norm = np.linalg.norm(features)
        residual_norm = np.linalg.norm(residuals)
        
        # Verifique se a norma dos res√≠duos √© maior que a norma das features originais
        if residual_norm > original_norm:
            print(f"Warning: Residual norm ({residual_norm}) greater than original norm ({original_norm}) for sample {sample_idx}")
        
        # Calcular a pontua√ß√£o OOD (norma dos res√≠duos sobre a norma das features originais)
        if original_norm == 0:
            ood_score = 0
        else:
            ood_score = residual_norm / original_norm
        
        # Adiciona a pontua√ß√£o OOD desta amostra √† lista total
        total_ood_scores.append(ood_score)
    
    # Retorna a m√©dia das pontua√ß√µes OOD
    return np.mean(total_ood_scores)

# Iterar sobre as categorias para calcular as m√©dias das pontua√ß√µes OOD
mean_ood_scores = {}

for category in categories:
    for other_category in categories:
        specific_indices = selected_indices_dict[category][other_category]
        
        # Recupera os objetos PCA para as categorias correspondentes
        pca_object = pca_results[perc][other_category]['pca_object']  # Usamos os componentes do other_category
        
        # Verificar se as features de teste existem para a categoria
        if category not in centralized_test_features_by_category:
            print(f"Warning: No test features found for {category}. Skipping.")
            continue
        
        # Calcular a m√©dia das pontua√ß√µes OOD com base na proje√ß√£o nos componentes espec√≠ficos
        mean_ood = calculate_mean_ood_for_specific_components(centralized_test_features_by_category[category], pca_object, specific_indices)
        
        # Armazenar a m√©dia no dicion√°rio
        mean_ood_scores[f"{category}_on_{other_category}"] = mean_ood

# Exibir todas as m√©dias calculadas
for key, mean_ood in mean_ood_scores.items():
    print(f"Mean OOD Score for {key}: {mean_ood}")


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Lista de categorias dispon√≠veis
categories = ['Bedroom', 'Suburb', 'Industry', 'Kitchen', 'LivingRoom', 'Coast', 'Forest', 
              'Highway', 'InsideCity', 'Mountain', 'OpenCountry', 'Street', 'Building', 
              'Office', 'Store']

# Inicializar uma matriz vazia para armazenar os OOD scores
ood_score_matrix = np.full((len(categories), len(categories)), np.nan)

# Preencher a matriz com os OOD scores calculados
for i, test_category in enumerate(categories):
    for j, train_category in enumerate(categories):
        key = f"{test_category}_on_{train_category}"
        if key in mean_ood_scores:
            ood_score_matrix[i, j] = mean_ood_scores[key]

# Criar um DataFrame a partir da matriz de OOD scores para facilitar o plot
ood_score_df = pd.DataFrame(ood_score_matrix, index=categories, columns=categories)

# Plotar o heatmap usando apenas matplotlib
fig, ax = plt.subplots(figsize=(12, 8))

# Criar o heatmap com imshow
cax = ax.imshow(ood_score_df, cmap="coolwarm", aspect="auto")

# Adicionar os valores na matriz
for i in range(len(categories)):
    for j in range(len(categories)):
        value = ood_score_matrix[i, j]
        if not np.isnan(value):
            ax.text(j, i, f'{value:.4f}', ha='center', va='center', color='black')

# Configurar os eixos
ax.set_xticks(np.arange(len(categories)))
ax.set_yticks(np.arange(len(categories)))
ax.set_xticklabels(categories, rotation=45, ha="right")
ax.set_yticklabels(categories)

# Adicionar t√≠tulo e r√≥tulos dos eixos
ax.set_title('Heatmap of OOD Scores for Test and Train Categories (Specific PCA Components)')
ax.set_xlabel('Train Category')
ax.set_ylabel('Test Category')

# Adicionar a barra de cores (colorbar)
fig.colorbar(cax, ax=ax, label='OOD Score')

# Ajustar layout
plt.tight_layout()

# Mostrar o heatmap
plt.show()
