<a href="https://colab.research.google.com/github/sergioGarcia91/ML_Carolina_Bays/blob/main/08_AOI_02_04_toCSV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, an Exploratory Data Analysis (EDA) will be conducted on H5 format files, focusing on two Areas of Interest (AOI) within Carolina Bays: **AOI 01** and **AOI 03**.  

- **AOI 01** contains *Fairy Circles* (FC).  
- **AOI 03** shows no evidence of FC.  

The main objective of this analysis is to explore and visualize the data, as well as to evaluate the differences between both AOIs. Additionally, a **normalized unit index** is proposed to relate the spectral bands and generate suitable features for training machine learning models.  

The areas **AOI 02** and **AOI 04** will be reserved as test sets to evaluate the performance of the trained models on the selected areas of interest.

# Start

In [None]:
!pip install tables

Collecting tables
  Downloading tables-3.10.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting numexpr>=2.6.2 (from tables)
  Downloading numexpr-2.10.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting py-cpuinfo (from tables)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting blosc2>=2.3.0 (from tables)
  Downloading blosc2-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.8 kB)
Collecting ndindex (from blosc2>=2.3.0->tables)
  Downloading ndindex-1.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading tables-3.10.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading blosc2-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[2K   [90m━━━━━━━

In [None]:
import numpy as np
import os
import time
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import h5py


In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Funtions

## reshape_array_X

In [None]:
def reshape_array_X(array_X):
  shape_X = array_X.shape
  reshaped_array = np.reshape(array_X, (shape_X[0], shape_X[1], -1))
  reshaped_array = np.transpose(reshaped_array, (0, 2, 1))
  reshaped_array = reshaped_array.reshape(-1, 7)
  return reshaped_array

# Load data

In [None]:
path_save_h5 = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_h5'

h5_file = os.listdir(path_save_h5)
h5_file

['dataset_AOI_01_32x32.h5',
 'dataset_AOI_02_32x32.h5',
 'dataset_AOI_03_32x32.h5',
 'dataset_AOI_04_32x32.h5']

In [None]:
# AOI 02
AOI_02 = h5py.File(os.path.join(path_save_h5, 'dataset_AOI_02_32x32.h5'), 'r')
# AOI 04
AOI_04 = h5py.File(os.path.join(path_save_h5, 'dataset_AOI_04_32x32.h5'), 'r')

print('AOI 02 [X]: ', AOI_02['AOI_02_X'].shape)
print('AOI 04 [X]: ', AOI_04['AOI_04_X'].shape)
print('\n')
print('AOI 02 [Y]: ', AOI_02['AOI_02_y'].shape)
print('AOI 04 [Y]: ', AOI_04['AOI_04_y'].shape)

AOI 02 [X]:  (100905, 7, 32, 32)
AOI 04 [X]:  (100905, 7, 32, 32)


AOI 02 [Y]:  (100905, 32, 32)
AOI 04 [Y]:  (100905, 32, 32)


## Get X and Y arrays

In [None]:
X_1 = AOI_02['AOI_02_X'][0::2]
y_1 = AOI_02['AOI_02_y'][0::2]

X_2 = AOI_04['AOI_04_X'][0::2]
y_2 = AOI_04['AOI_04_y'][0::2]

print('X_1: ', X_1.shape)
print('y_1: ', y_1.shape)
print('\n')
print('X_2: ', X_2.shape)
print('y_2: ', y_2.shape)

X_1:  (50453, 7, 32, 32)
y_1:  (50453, 32, 32)


X_2:  (50453, 7, 32, 32)
y_2:  (50453, 32, 32)


## Reshape

In [None]:
X_1_r = reshape_array_X(X_1)
y_1_r = y_1.reshape(-1,1)

X_2_r = reshape_array_X(X_2)
y_2_r = y_2.reshape(-1,1)

print('X_1_r: ', X_1_r.shape)
print('y_1_r: ', y_1_r.shape)
print('\n')
print('X_2_r: ', X_2_r.shape)
print('y_2_R: ', y_2_r.shape)

X_1_r:  (51663872, 7)
y_1_r:  (51663872, 1)


X_2_r:  (51663872, 7)
y_2_R:  (51663872, 1)


In [None]:
sum(X_1[0][6].flatten() - X_1_r[0:1024,6])

0.0

In [None]:
np.sum(y_1[0].flatten() - y_1.reshape(-1,1)[0:1024])

0.0

# DataFrame

In [None]:
X = np.concatenate((X_1_r, X_2_r), axis=0)
y = np.concatenate((y_1_r, y_2_r), axis=0)

print('X: ', X.shape)
print('y: ', y.shape)

X:  (103327744, 7)
y:  (103327744, 1)


In [None]:
df = pd.DataFrame(X, columns=['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7'])

df.head()

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7
0,0.02825,0.04222,0.070765,0.066145,0.344665,0.23494,0.129945
1,0.021843,0.03144,0.063588,0.056878,0.33482,0.197293,0.106295
2,0.01978,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775
3,0.029048,0.038782,0.08567,0.080995,0.321235,0.227185,0.13649
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.17719,0.087623


In [None]:
del X_1_r, y_1_r, X_2_r, y_2_r, X_1, X_2, y_1, y_2

## Normalized unit index

Normalized indices are generally defined as:  

\begin{equation}
\text{Index} = \frac{B_2 - B_1}{B_2 + B_1}
\end{equation}

Following this same methodology, the implementation will consider only relationships with **higher wavelength bands**. Specifically:  

- **Band 1** will be related to the six higher bands.  
- **Band 2** will be related to **Bands 3 to 7**, but **not to Band 1**, as this would result in an inverted sign of the previously computed relationship.  

To ensure that the resulting values fall within the **[0,1]** range, the following transformation will be applied:  

\begin{equation}
\text{Normalized Index} = \frac{\left( \frac{B_j - B_i}{B_j + B_i} \right) + 1}{2}
\end{equation}

where $B_i$ and $B_j$ represent the bands involved, ensuring that $j > i$.  



In [None]:
bans = ['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7']

for i in range(len(bans)-1):
  for j in range(i+1, len(bans)):
    b1 = bans[i]
    b2 = bans[j]

    col_name = b2 + '_' + b1
    df[col_name] = ( (df[b2] - df[b1]) / (df[b2] + df[b1]) + 1 ) / 2  # Normalize to [0,1]

df

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B4_B3,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6
0,0.028250,0.042220,0.070765,0.066145,0.344665,0.234940,0.129945,0.599120,0.714690,0.700726,...,0.483128,0.829658,0.768519,0.647427,0.838989,0.780311,0.662680,0.405345,0.273793,0.356126
1,0.021843,0.031440,0.063588,0.056878,0.334820,0.197293,0.106295,0.590062,0.744323,0.722529,...,0.472150,0.840396,0.756258,0.625697,0.854792,0.776223,0.651427,0.370772,0.240969,0.350130
2,0.019780,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775,0.612366,0.771045,0.759001,...,0.483254,0.836593,0.775644,0.651950,0.845548,0.787089,0.666996,0.403083,0.267867,0.351412
3,0.029048,0.038782,0.085670,0.080995,0.321235,0.227185,0.136490,0.571760,0.746791,0.736034,...,0.485975,0.789459,0.726167,0.614377,0.798635,0.737183,0.627584,0.414254,0.298192,0.375308
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.177190,0.087623,0.616845,0.799304,0.755702,...,0.437162,0.866160,0.760937,0.611505,0.892843,0.803847,0.669590,0.329688,0.195638,0.330885
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103327739,0.211372,0.220282,0.222318,0.213215,0.418337,0.254052,0.203452,0.510321,0.512618,0.502170,...,0.489550,0.652984,0.533309,0.477846,0.662395,0.543698,0.488285,0.377835,0.327205,0.444700
103327740,0.176035,0.167675,0.273358,0.251440,0.424002,0.314910,0.248718,0.487839,0.608282,0.588198,...,0.479118,0.608011,0.535318,0.476402,0.627740,0.556034,0.497278,0.426180,0.369719,0.441280
103327741,0.035372,0.052313,0.137095,0.118918,0.319943,0.229605,0.168197,0.596596,0.794903,0.770740,...,0.464499,0.700036,0.626139,0.550939,0.729031,0.658795,0.585819,0.417807,0.344568,0.422817
103327742,0.014225,0.023217,0.088970,0.066475,0.295247,0.176750,0.114957,0.620084,0.862154,0.823730,...,0.427643,0.768438,0.665174,0.563717,0.816226,0.726693,0.633610,0.374472,0.280244,0.394085


In [None]:
df.head()

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B4_B3,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6
0,0.02825,0.04222,0.070765,0.066145,0.344665,0.23494,0.129945,0.59912,0.71469,0.700726,...,0.483128,0.829658,0.768519,0.647427,0.838989,0.780311,0.66268,0.405345,0.273793,0.356126
1,0.021843,0.03144,0.063588,0.056878,0.33482,0.197293,0.106295,0.590062,0.744323,0.722529,...,0.47215,0.840396,0.756258,0.625697,0.854792,0.776223,0.651427,0.370772,0.240969,0.35013
2,0.01978,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775,0.612366,0.771045,0.759001,...,0.483254,0.836593,0.775644,0.65195,0.845548,0.787089,0.666996,0.403083,0.267867,0.351412
3,0.029048,0.038782,0.08567,0.080995,0.321235,0.227185,0.13649,0.57176,0.746791,0.736034,...,0.485975,0.789459,0.726167,0.614377,0.798635,0.737183,0.627584,0.414254,0.298192,0.375308
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.17719,0.087623,0.616845,0.799304,0.755702,...,0.437162,0.86616,0.760937,0.611505,0.892843,0.803847,0.66959,0.329688,0.195638,0.330885


In [None]:
df['y'] = y

df.iloc[:100,:]

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6,y
0,0.028250,0.042220,0.070765,0.066145,0.344665,0.234940,0.129945,0.599120,0.714690,0.700726,...,0.829658,0.768519,0.647427,0.838989,0.780311,0.662680,0.405345,0.273793,0.356126,0.0
1,0.021843,0.031440,0.063588,0.056878,0.334820,0.197293,0.106295,0.590062,0.744323,0.722529,...,0.840396,0.756258,0.625697,0.854792,0.776223,0.651427,0.370772,0.240969,0.350130,0.0
2,0.019780,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775,0.612366,0.771045,0.759001,...,0.836593,0.775644,0.651950,0.845548,0.787089,0.666996,0.403083,0.267867,0.351412,0.0
3,0.029048,0.038782,0.085670,0.080995,0.321235,0.227185,0.136490,0.571760,0.746791,0.736034,...,0.789459,0.726167,0.614377,0.798635,0.737183,0.627584,0.414254,0.298192,0.375308,0.0
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.177190,0.087623,0.616845,0.799304,0.755702,...,0.866160,0.760937,0.611505,0.892843,0.803847,0.669590,0.329688,0.195638,0.330885,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.023932,0.030643,0.073350,0.046455,0.454170,0.212362,0.092215,0.561475,0.753990,0.659989,...,0.860953,0.743273,0.556972,0.907206,0.820511,0.664996,0.318608,0.168773,0.302764,1.0
96,0.036968,0.052175,0.103325,0.099420,0.377280,0.312215,0.172927,0.585299,0.736497,0.728952,...,0.785011,0.751348,0.625976,0.791441,0.758475,0.634952,0.452817,0.314295,0.356447,0.0
97,0.033915,0.052478,0.102940,0.099942,0.366033,0.306412,0.173973,0.607431,0.752183,0.746634,...,0.780499,0.748530,0.628258,0.785520,0.754051,0.635133,0.455669,0.322168,0.362152,0.0
98,0.032788,0.049205,0.097742,0.090153,0.355885,0.269233,0.147573,0.600116,0.748813,0.733305,...,0.784531,0.733654,0.601563,0.797881,0.749148,0.620770,0.430691,0.293118,0.354056,0.0


In [None]:
# Replace NaN with 0
df = df.fillna(0)

df.iloc[:100,:]

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6,y
0,0.028250,0.042220,0.070765,0.066145,0.344665,0.234940,0.129945,0.599120,0.714690,0.700726,...,0.829658,0.768519,0.647427,0.838989,0.780311,0.662680,0.405345,0.273793,0.356126,0.0
1,0.021843,0.031440,0.063588,0.056878,0.334820,0.197293,0.106295,0.590062,0.744323,0.722529,...,0.840396,0.756258,0.625697,0.854792,0.776223,0.651427,0.370772,0.240969,0.350130,0.0
2,0.019780,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775,0.612366,0.771045,0.759001,...,0.836593,0.775644,0.651950,0.845548,0.787089,0.666996,0.403083,0.267867,0.351412,0.0
3,0.029048,0.038782,0.085670,0.080995,0.321235,0.227185,0.136490,0.571760,0.746791,0.736034,...,0.789459,0.726167,0.614377,0.798635,0.737183,0.627584,0.414254,0.298192,0.375308,0.0
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.177190,0.087623,0.616845,0.799304,0.755702,...,0.866160,0.760937,0.611505,0.892843,0.803847,0.669590,0.329688,0.195638,0.330885,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.023932,0.030643,0.073350,0.046455,0.454170,0.212362,0.092215,0.561475,0.753990,0.659989,...,0.860953,0.743273,0.556972,0.907206,0.820511,0.664996,0.318608,0.168773,0.302764,1.0
96,0.036968,0.052175,0.103325,0.099420,0.377280,0.312215,0.172927,0.585299,0.736497,0.728952,...,0.785011,0.751348,0.625976,0.791441,0.758475,0.634952,0.452817,0.314295,0.356447,0.0
97,0.033915,0.052478,0.102940,0.099942,0.366033,0.306412,0.173973,0.607431,0.752183,0.746634,...,0.780499,0.748530,0.628258,0.785520,0.754051,0.635133,0.455669,0.322168,0.362152,0.0
98,0.032788,0.049205,0.097742,0.090153,0.355885,0.269233,0.147573,0.600116,0.748813,0.733305,...,0.784531,0.733654,0.601563,0.797881,0.749148,0.620770,0.430691,0.293118,0.354056,0.0


In [None]:
df_describe = df.describe()

In [None]:
df_describe.iloc[0,:]

Unnamed: 0,count
B1,103327744.0
B2,103327744.0
B3,103327744.0
B4,103327744.0
B5,103327744.0
B6,103327744.0
B7,103327744.0
B2_B1,103327744.0
B3_B1,103327744.0
B4_B1,103327744.0


In [None]:
np.round(df_describe.iloc[1:,:], 2)

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6,y
mean,0.07,0.08,0.11,0.1,0.38,0.22,0.13,0.56,0.72,0.67,...,0.81,0.72,0.57,0.84,0.77,0.64,0.35,0.22,0.33,0.09
std,0.15,0.15,0.14,0.15,0.14,0.13,0.11,0.14,0.13,0.14,...,0.11,0.1,0.08,0.12,0.12,0.1,0.08,0.1,0.05,0.28
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.01,0.02,0.04,0.03,0.32,0.14,0.06,0.52,0.65,0.59,...,0.78,0.7,0.54,0.78,0.74,0.62,0.29,0.14,0.29,0.0
50%,0.02,0.03,0.06,0.04,0.38,0.19,0.08,0.57,0.74,0.68,...,0.86,0.75,0.57,0.9,0.81,0.66,0.33,0.19,0.32,0.0
75%,0.04,0.05,0.1,0.1,0.45,0.29,0.16,0.59,0.79,0.72,...,0.89,0.78,0.61,0.93,0.84,0.69,0.41,0.3,0.37,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.73,1.0


## Counts fo 0 and 1

In [None]:
# Count the number of 0s and 1s in column 'Y'
counts = df['y'].value_counts()

print(counts)
print('\n')
print('Counts of 0 and 1 in column "Y":')
print('Total 0: ', counts[0])
print('Total 1: ', counts[1])
print('Total 0 - Total 1 = ', counts[0]-counts[1])
print('Total 0 / Total 1 = ', counts[0]/counts[1])


y
0.0    94443978
1.0     8883766
Name: count, dtype: int64


Counts of 0 and 1 in column "Y":
Total 0:  94443978
Total 1:  8883766
Total 0 - Total 1 =  85560212
Total 0 / Total 1 =  10.631074478999109


# To H5

It was not possible to save as CSV because it took too much time; it was saved as H5 instead.

In [None]:
path_saveCSV = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_CSV'

df.to_hdf(os.path.join(path_saveCSV, 'TEST_CarolinaBays_AOI_02_04.h5'),
          key='df',
          mode='w')

In [None]:
os.listdir(path_saveCSV)

['TRAIN_CarolinaBays_AOI_01_03.h5', 'TEST_CarolinaBays_AOI_02_04.h5']

In [None]:
df2 = pd.read_hdf(os.path.join(path_saveCSV, 'TEST_CarolinaBays_AOI_02_04.h5'), 'df')

df2.head()

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6,y
0,0.02825,0.04222,0.070765,0.066145,0.344665,0.23494,0.129945,0.59912,0.71469,0.700726,...,0.829658,0.768519,0.647427,0.838989,0.780311,0.66268,0.405345,0.273793,0.356126,0.0
1,0.021843,0.03144,0.063588,0.056878,0.33482,0.197293,0.106295,0.590062,0.744323,0.722529,...,0.840396,0.756258,0.625697,0.854792,0.776223,0.651427,0.370772,0.240969,0.35013,0.0
2,0.01978,0.031248,0.066612,0.062295,0.341035,0.230292,0.124775,0.612366,0.771045,0.759001,...,0.836593,0.775644,0.65195,0.845548,0.787089,0.666996,0.403083,0.267867,0.351412,0.0
3,0.029048,0.038782,0.08567,0.080995,0.321235,0.227185,0.13649,0.57176,0.746791,0.736034,...,0.789459,0.726167,0.614377,0.798635,0.737183,0.627584,0.414254,0.298192,0.375308,0.0
4,0.013977,0.022503,0.055668,0.043237,0.360258,0.17719,0.087623,0.616845,0.799304,0.755702,...,0.86616,0.760937,0.611505,0.892843,0.803847,0.66959,0.329688,0.195638,0.330885,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103327744 entries, 0 to 103327743
Data columns (total 29 columns):
 #   Column  Dtype  
---  ------  -----  
 0   B1      float32
 1   B2      float32
 2   B3      float32
 3   B4      float32
 4   B5      float32
 5   B6      float32
 6   B7      float32
 7   B2_B1   float32
 8   B3_B1   float32
 9   B4_B1   float32
 10  B5_B1   float32
 11  B6_B1   float32
 12  B7_B1   float32
 13  B3_B2   float32
 14  B4_B2   float32
 15  B5_B2   float32
 16  B6_B2   float32
 17  B7_B2   float32
 18  B4_B3   float32
 19  B5_B3   float32
 20  B6_B3   float32
 21  B7_B3   float32
 22  B5_B4   float32
 23  B6_B4   float32
 24  B7_B4   float32
 25  B6_B5   float32
 26  B7_B5   float32
 27  B7_B6   float32
 28  y       float32
dtypes: float32(29)
memory usage: 11.2 GB


# End