<a href="https://colab.research.google.com/github/sergioGarcia91/ML_Carolina_Bays/blob/main/04_EDA_h5_AOI_01_03_toCSV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, an Exploratory Data Analysis (EDA) will be conducted on H5 format files, focusing on two Areas of Interest (AOI) within Carolina Bays: **AOI 01** and **AOI 03**.  

- **AOI 01** contains *Fairy Circles* (FC).  
- **AOI 03** shows no evidence of FC.  

The main objective of this analysis is to explore and visualize the data, as well as to evaluate the differences between both AOIs. Additionally, a **normalized unit index** is proposed to relate the spectral bands and generate suitable features for training machine learning models.  

The areas **AOI 02** and **AOI 04** will be reserved as test sets to evaluate the performance of the trained models on the selected areas of interest.

# Start

In [None]:
!pip install tables

In [None]:
import numpy as np
import os
import time
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import h5py

from IPython.display import clear_output
from sklearn.decomposition import PCA

In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Funtions

## reshape_array_X

In [None]:
def reshape_array_X(array_X):
  shape_X = array_X.shape
  reshaped_array = np.reshape(array_X, (shape_X[0], shape_X[1], -1))
  reshaped_array = np.transpose(reshaped_array, (0, 2, 1))
  reshaped_array = reshaped_array.reshape(-1, 7)
  return reshaped_array

# Load data

In [None]:
path_save_h5 = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_h5'

h5_file = os.listdir(path_save_h5)
h5_file

In [None]:
# AOI 01
AOI_01 = h5py.File(os.path.join(path_save_h5, 'dataset_AOI_01.h5'), 'r')
# AOI 03
AOI_03 = h5py.File(os.path.join(path_save_h5, 'dataset_AOI_03.h5'), 'r')

print('AOI 01 [X]: ', AOI_01['AOI_01_X'].shape)
print('AOI 03 [X]: ', AOI_03['AOI_03_X'].shape)
print('\n')
print('AOI 01 [Y]: ', AOI_01['AOI_01_y'].shape)
print('AOI 03 [Y]: ', AOI_03['AOI_03_y'].shape)

## Get X and Y arrays

In [None]:
X_1 = AOI_01['AOI_01_X'][0::2]
y_1 = AOI_01['AOI_01_y'][0::2]

X_2 = AOI_03['AOI_03_X'][0::2]
y_2 = AOI_03['AOI_03_y'][0::2]

print('X_1: ', X_1.shape)
print('y_1: ', y_1.shape)
print('\n')
print('X_2: ', X_2.shape)
print('y_2: ', y_2.shape)

## Reshape

In [None]:
X_1_r = reshape_array_X(X_1)
y_1_r = y_1.reshape(-1,1)

X_2_r = reshape_array_X(X_2)
y_2_r = y_2.reshape(-1,1)

print('X_1_r: ', X_1_r.shape)
print('y_1_r: ', y_1_r.shape)
print('\n')
print('X_2_r: ', X_2_r.shape)
print('y_2_R: ', y_2_r.shape)

In [None]:
sum(X_1[0][6].flatten() - X_1_r[0:1024,6])

In [None]:
np.sum(y_1[0].flatten() - y_1.reshape(-1,1)[0:1024])

# DataFrame

In [None]:
X = np.concatenate((X_1_r, X_2_r), axis=0)
y = np.concatenate((y_1_r, y_2_r), axis=0)

print('X: ', X.shape)
print('y: ', y.shape)

In [None]:
df = pd.DataFrame(X, columns=['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7'])

df.head()

In [None]:
del X_1_r, y_1_r, X_2_r, y_2_r, X_1, X_2, y_1, y_2

## Normalized unit index

Normalized indices are generally defined as:  

\begin{equation}
\text{Index} = \frac{B_2 - B_1}{B_2 + B_1}
\end{equation}

Following this same methodology, the implementation will consider only relationships with **higher wavelength bands**. Specifically:  

- **Band 1** will be related to the six higher bands.  
- **Band 2** will be related to **Bands 3 to 7**, but **not to Band 1**, as this would result in an inverted sign of the previously computed relationship.  

To ensure that the resulting values fall within the **[0,1]** range, the following transformation will be applied:  

\begin{equation}
\text{Normalized Index} = \frac{\left( \frac{B_j - B_i}{B_j + B_i} \right) + 1}{2}
\end{equation}

where $B_i$ and $B_j$ represent the bands involved, ensuring that $j > i$.  



In [None]:
bans = ['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7']

for i in range(len(bans)-1):
  for j in range(i+1, len(bans)):
    b1 = bans[i]
    b2 = bans[j]

    col_name = b2 + '_' + b1
    df[col_name] = ( (df[b2] - df[b1]) / (df[b2] + df[b1]) + 1 ) / 2  # Normalize to [0,1]

df

In [None]:
df.head()

In [None]:
df['y'] = y

df.iloc[:100,:]

In [None]:
# Replace NaN with 0
df = df.fillna(0)

df.iloc[:100,:]

In [None]:
df_describe = df.describe()

In [None]:
df_describe.iloc[0,:]

In [None]:
np.round(df_describe.iloc[1:,:], 2)

## Counts fo 0 and 1

In [None]:
# Count the number of 0s and 1s in column 'Y'
counts = df['y'].value_counts()

print(counts)
print('\n')
print('Counts of 0 and 1 in column "Y":')
print('Total 0: ', counts[0])
print('Total 1: ', counts[1])
print('Total 0 - Total 1 = ', counts[0]-counts[1])
print('Total 0 / Total 1 = ', counts[0]/counts[1])


It appears that there are more data points with category **0** than with category **1**.  

**Counts of 0 and 1 in column "Y":**  
- **Total 0:** 91'701.160  
- **Total 1:** 11'626.584  
- **Difference (0 - 1):** 80'074.576  
- **Ratio (0 / 1):** 7.89  

---
This was expected, having more instances of category **0** than **1**. Therefore, we will proceed to save the data as is.  

For training, this imbalance can be addressed by applying **downsampling** to the category **0**.

# To H5

It was not possible to save as CSV because it took too much time; it was saved as H5 instead.

In [None]:
path_saveCSV = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_CSV'

df.to_hdf(os.path.join(path_saveCSV, 'TRAIN_CarolinaBays_AOI_01_03.h5'),
          key='df',
          mode='w')

# Read H5

In [None]:
path_saveCSV = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_CSV'

df = pd.read_hdf(os.path.join(path_saveCSV, 'TRAIN_CarolinaBays_AOI_01_03.h5'), 'df')

df.head()

In [None]:
# Total of data 103327744
df.info()

In [None]:
df

## Change Font of the Figures

In [None]:
!wget https://github.com/justrajdeep/fonts/raw/master/Times%20New%20Roman.ttf

In [None]:
import matplotlib.font_manager as fm

In [None]:
# Path to the custom font
font_path = 'Times New Roman.ttf'

# Add the font to the Matplotlib font manager
font_prop = fm.FontProperties(fname=font_path)
fm.fontManager.addfont(font_path)

# Get the font name to use in rcParams
font_name = font_prop.get_name()
font_name

In [None]:
plt.rcParams['font.family'] = font_name

# Plots

In [None]:
pathSavePlots = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Figures_EDA/'


## Boxplots

In [None]:
df.columns

In [None]:
df.columns[:-1]

In [None]:
columnasBandas = df.columns[:-1]

In [None]:
for band in columnasBandas:
  plt.figure(figsize=(5, 3))

  sns.boxplot(data=df,
              x=band,
              hue='y',
              fill=False,
              gap=.1)
  #plt.yscale('log')
  plt.title(f'Boxplot: {band}')
  plt.xlim(-0.1, 1.1)

  plt.grid(True, ls='--', color='gray', alpha=0.8)

  plt.savefig(pathSavePlots + f'Boxplot_{band}.png',
              dpi=500,
              bbox_inches = 'tight',
              pad_inches=0.25)

  plt.show()

  print('\n')

In [None]:
# Get the number of available cores
num_cores = os.cpu_count()

print(f"The number of available cores is: {num_cores}")

## Scatter plots

It will only plot the points at steps of every 100 steps to avoid delays.

In [None]:
for B1 in range(0, len(columnasBandas)):
  for B2 in range(B1 + 1, len(columnasBandas)):
    if B1 != B2:
      plt.figure(figsize=(5, 5))
      for i in [0, 1]:
        plt.scatter(x=df[columnasBandas[B1]][df['y'] == i].iloc[::100],
                    y=df[columnasBandas[B2]][df['y'] == i].iloc[::100],
                    label=i,
                    s=5)
      #plt.yscale('log')
      plt.title(f'{columnasBandas[B1]} vs {columnasBandas[B2]}')

      plt.xlabel(f'{columnasBandas[B1]}')
      plt.ylabel(f'{columnasBandas[B2]}')
      plt.xlim(-0.1, 1.1)
      plt.ylim(-0.1, 1.1)

      plt.grid(True, ls='--', color='gray', alpha=0.8)
      plt.legend()

      plt.savefig(pathSavePlots + f'Scatterplot_{columnasBandas[B1]}_vs_{columnasBandas[B2]}.png',
                  dpi=500,
                  bbox_inches = 'tight',
                  pad_inches=0.25)

      plt.show()
      print('\n')

## Histograms

In [None]:
bins_hist = np.linspace(0, 1, 50)
bins_hist

In [None]:
for band in columnasBandas:
  plt.figure(figsize=(10, 5))
  sns.histplot(data=df,
               x=band,
               hue='y',
               stat='count', # percent count
               fill=True,
               element="step",
               bins = bins_hist)
  plt.yscale('log')
  plt.xlim(-0.1, 1.1)

  plt.title(f'Histogram: {band}')

  plt.grid(True, ls='--', color='gray', alpha=0.8)

  plt.savefig(pathSavePlots + f'Histogram_{band}.png',
              dpi=500,
              bbox_inches = 'tight',
              pad_inches=0.25)

  plt.show()

  print('\n')

## Pearson's Linear Correlation

In [None]:
pearsonCorrelationMatrix = df[columnasBandas].corr(method='pearson')

# Calculate the mean absolute correlation for each variable
corr_mean = pearsonCorrelationMatrix.abs().mean().sort_values(ascending=False)

# Reorder the rows and columns of the correlation matrix based on the mean correlation
ordered_corr = pearsonCorrelationMatrix.loc[corr_mean.index, corr_mean.index]

In [None]:
plt.figure(figsize=(15, 10))
sns.heatmap(ordered_corr, # pearsonCorrelationMatrix
            annot=True,
            vmin=-1,
            vmax=1,
            cmap='coolwarm',
            fmt='.2f',
            linewidth=.5,
            cbar_kws={'shrink': 0.75})  # Colorbar size

plt.title("Pearson's Linear Correlation")

plt.savefig(pathSavePlots + 'Pearson.png',
            dpi=500,
            bbox_inches = 'tight',
            pad_inches=0.25)

plt.show()

# PCA

Since the data is already scaled or within a range of 0 to 1, no further modifications will be made before proceeding with the PCA.

In [None]:
columnasBandas

In [None]:
components = len(columnasBandas)  # Set the number of new components
# Since the original is 28, we will keep the same value to see how it behaves
# and then decide which ones to keep
pca = PCA(n_components = components)
pca

## df PCA - Fit

In [None]:
X_scaled = df[columnasBandas].copy()

pca.fit(X_scaled)

In [None]:
X_scaled

In [None]:
X_pca = pca.transform(X_scaled) # transformar

df_pca = pd.DataFrame(X_pca, columns= [f'PC_{i+1}' for i in range(X_pca.shape[1])])

df_pca['y'] = df['y']

df_pca.head()

In [None]:
df_pca.describe()

## Explained variance

In [None]:
# Variance explained by each principal component
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

print(np.round(cumulative_variance, 2))

In [None]:
# Plot the explained variance
plt.figure(figsize=(10, 5))

plt.bar(range(1, len(explained_variance) + 1),
        explained_variance,
        alpha=0.5,
        align='center',
        label='Individual explained variance')

plt.step(range(1, len(explained_variance) + 1),
         cumulative_variance,
         where='mid',
         label='Cumulative explained variance')

# Set Y-axis ticks every 0.1
plt.yticks(np.arange(0, 1.1, 0.1))
plt.xticks(np.arange(0, 29, 1))

plt.ylabel('Explained variance ratio')
plt.xlabel('Principal Component')
plt.xlim(0, 29)

plt.legend(loc='best')
plt.grid(True, ls='--', color='gray', alpha=0.8)

plt.title('Variance Explained by Each Principal Component')

plt.savefig(pathSavePlots + 'PCA.png',
            dpi=500,
            bbox_inches='tight',
            pad_inches=0.25)

plt.show()

In [None]:
# Singular values considering SVD ... but for something more everyday
# consider them as the "eigenvalues"
np.round(pca.singular_values_, 3)

In [None]:
np.round(pca.components_, 2)  # the "eigenvectors"

In [None]:
np.linalg.norm(pca.components_, axis=1)  # the norm of each eigenvector should be 1
# since all are unit vectors, and the eigenvalue sets their magnitude

## Eigenvector Plots

In [None]:
df_pca_AutoVec = pd.DataFrame(pca.components_,
                              columns=columnasBandas,
                              index=[f'PC{i+1}' for i in range(components)])
df_pca_AutoVec

In [None]:
df_pca_AutoVec.abs().sum(axis=0).sort_values(ascending=False)

In [None]:
# List of the 28 principal components
PCA_components = [f'PC{i}' for i in range(1, 29)]  # From PC1 to PC28
bands = columnasBandas.copy()

# For each principal component, create a plot
for i in range(len(PCA_components)):
  plt.figure(figsize=(10, 4))

  # Plot the weights of each band in the current PCA component
  plt.bar(bands, pca.components_[i, :], color='b', alpha=0.7)

  # Configure the plot
  plt.title(f'Contribution of bands in {PCA_components[i]}')
  plt.ylabel('Eigenvector weight')
  plt.xlabel('Original bands')
  # Rotate X-axis labels by 90 degrees
  plt.xticks(rotation=90)

  plt.grid(True, ls='--', color='gray', alpha=0.6)

  plt.savefig(pathSavePlots + f'PCA_{PCA_components[i]}_axes.png',
              dpi=500,
              bbox_inches='tight',
              pad_inches=0.25)

  # Show the plot
  plt.show()
  print()


In [None]:
# List of the 28 principal components
PCA_components = [f'PC{i}' for i in range(1, 29)]  # From PC1 to PC28
bands = columnasBandas.copy()

# For each principal component, create a plot
for i in range(len(PCA_components)):
  plt.figure(figsize=(10, 4))

  # Plot the absolute weights of each band in the current PCA component
  plt.bar(bands, abs(pca.components_[i, :]), color='b', alpha=0.7)

  # Configure the plot
  plt.title(f'Contribution of bands in {PCA_components[i]}')
  plt.ylabel('Absolute eigenvector weight')
  plt.xlabel('Original bands')
  # Rotate X-axis labels by 90 degrees
  plt.xticks(rotation=90)

  plt.grid(True, ls='--', color='gray', alpha=0.6)

  plt.savefig(pathSavePlots + f'PCA_{PCA_components[i]}_axes_absolute.png',
              dpi=500,
              bbox_inches='tight',
              pad_inches=0.25)
  # Show the plot
  plt.show()
  print()


## Scatter plot PCA

In [None]:
df_pca.columns

In [None]:
columsnasPCA = df_pca.columns[:-1]
columsnasPCA

In [None]:
for band1 in range(0, len(columsnasPCA)):
  for band2 in range(band1 + 1, len(columsnasPCA)):
    if band1 != band2:
      plt.figure(figsize=(5, 5))

      for i in [0, 1]:
        plt.scatter(x=df_pca[columsnasPCA[band1]][df_pca['y'] == i].iloc[::100],
                    y=df_pca[columsnasPCA[band2]][df_pca['y'] == i].iloc[::100],
                    label=i,
                    s=5)

      plt.title(f'{columsnasPCA[band1]} vs {columsnasPCA[band2]}')
      plt.xlabel(f'{columsnasPCA[band1]}')
      plt.ylabel(f'{columsnasPCA[band2]}')

      plt.grid(True, ls='--', color='gray', alpha=0.8)
      plt.legend()

      plt.savefig(pathSavePlots + f'Scatterplot_PCA_{columsnasPCA[band1]}_{columsnasPCA[band2]}_v2.png',
                  dpi=500,
                  bbox_inches='tight',
                  pad_inches=0.25)

      plt.show()
      print('\n')


# End