<a href="https://colab.research.google.com/github/sergioGarcia91/ML_Carolina_Bays/blob/main/07a_MLPClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this Notebook, **30 Random Forest Classifier models** will be trained, following a similar approach to logistic regression. Since category 0 contains more pixels than category 1, **downsampling** will be applied to category 0 to balance the dataset.  

The process involves iteratively separating the data from both categories. In each iteration, the number of samples in category 1 will be counted, and an equal number of samples from category 0 will be randomly selected.  

To introduce more randomness during training, a new **train-test split** will be performed in each iteration, ensuring that the training data for category 1 varies in every cycle.  



# Start

In [None]:
!pip install tables

Collecting tables
  Downloading tables-3.10.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting numexpr>=2.6.2 (from tables)
  Downloading numexpr-2.10.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting py-cpuinfo (from tables)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting blosc2>=2.3.0 (from tables)
  Downloading blosc2-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.8 kB)
Collecting ndindex (from blosc2>=2.3.0->tables)
  Downloading ndindex-1.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading tables-3.10.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading blosc2-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[2K   [90m━━━━━━━

In [None]:
import numpy as np
import os
import time
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import h5py
import multiprocessing
import joblib

from IPython.display import clear_output
from sklearn.neural_network import MLPClassifier

In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load data

In [None]:
path_saveCSV = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/Dataset_CSV'

df = pd.read_hdf(os.path.join(path_saveCSV, 'TRAIN_CarolinaBays_AOI_01_03.h5'), 'df')

df.head()

Unnamed: 0,B1,B2,B3,B4,B5,B6,B7,B2_B1,B3_B1,B4_B1,...,B5_B3,B6_B3,B7_B3,B5_B4,B6_B4,B7_B4,B6_B5,B7_B5,B7_B6,y
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Total of data 103327744
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 103327744 entries, 0 to 103327743
Data columns (total 29 columns):
 #   Column  Dtype  
---  ------  -----  
 0   B1      float32
 1   B2      float32
 2   B3      float32
 3   B4      float32
 4   B5      float32
 5   B6      float32
 6   B7      float32
 7   B2_B1   float32
 8   B3_B1   float32
 9   B4_B1   float32
 10  B5_B1   float32
 11  B6_B1   float32
 12  B7_B1   float32
 13  B3_B2   float32
 14  B4_B2   float32
 15  B5_B2   float32
 16  B6_B2   float32
 17  B7_B2   float32
 18  B4_B3   float32
 19  B5_B3   float32
 20  B6_B3   float32
 21  B7_B3   float32
 22  B5_B4   float32
 23  B6_B4   float32
 24  B7_B4   float32
 25  B6_B5   float32
 26  B7_B5   float32
 27  B7_B6   float32
 28  y       float32
dtypes: float32(29)
memory usage: 11.9 GB


# Split and training

In [None]:
num_cores = multiprocessing.cpu_count()
print(f"Number of available cores: {num_cores}")


Number of available cores: 96


In [None]:
path_save_models = '/content/drive/MyDrive/UIS/Doctorado_UIS2198589/1_semestre/TopicosAvanzadosGeofisica/FC_CarolinaBais/ML_models/'


- After **20 minutes**, only **2 iterations** were completed, so the process was stopped to continue later. To improve training time, the **batch size will be reduced** from the default **200**, as the second iteration took more than 20 minutes to appear.  

- To optimize performance, training will be set to **100 batches**, achieving **3 iterations in 10 minutes**. Then, it will be reduced to **50 batches** to see if training time improves. While the process appears faster, validation performance does not show significant improvement, so the batch size will remain at **100**.  

- Additionally, to further optimize time, the **first layer will be removed**, starting the pyramid structure from **28 neurons**. So far, the scores have remained around **0.7**.  

- With this setup, **4 iterations were completed in 10 minutes**, showing a **gradual improvement in validation** while the training score slightly decreases. To evaluate the impact of the first layer, it will be reintroduced with **28 neurons** and tested again to check if the training time remains similar.  

- After testing, **only 2 iterations were completed in 10 minutes**, and the scores behaved similarly. Since no significant improvements were observed, **the added input layer will be removed again**.  

-  The **loss** always started at **0.7** in all models and decreased **very slowly**.

- The first model took approximately 1 hour and 20 minutes. The tolerance was adjusted to 1e-3 (0.001) so that if the validation does not improve significantly, the training stops earlier.



In [None]:
print_text = True
print_text_Training = True
verbose_print = True

count_models = 9 # Indicate the model number that will be saved
# If the process was stopped and you want to continue from the previous amount,
# you should specify the number from which you want to start

total_models = count_models + 2

target_score = 0.6 # In the tests, it never exceeded a score of 0.6

count_trial = 1

train_score_list = []
test_score_list = []
models_name_list = []
elapsed_time_list = []
trial_list = []


while count_models < total_models:
  # Start the timer
  start_time = time.time() # Each iteration takes less than 10 minutes

  clear_output(wait=True)

  # Create empty DataFrames for train and test
  df_train = pd.DataFrame()
  df_test = pd.DataFrame()

  # Filter data
  category_data_1 = df[df['y'] == 1].copy().reset_index(drop=True)
  category_data_0 = df[df['y'] == 0].copy().reset_index(drop=True)

  # Calculate 80% for train and 20% for test
  train_size = int(0.8 * len(category_data_1))
  test_size = len(category_data_1) - train_size

  # Select randomly to shuffle the data
  category_data_1 = category_data_1.sample(frac=1).reset_index(drop=True)
  category_data_0 = category_data_0.sample(frac=1).reset_index(drop=True)

  # Split into train and test
  category_train_1 = category_data_1[:train_size]
  category_train_0 = category_data_0[:train_size]
  category_test_1 = category_data_1[train_size:]
  category_test_0 = category_data_0[train_size:]

  category_train = pd.concat([category_train_1, category_train_0], ignore_index=True)
  category_train = category_train.sample(frac=1).reset_index(drop=True)
  category_test = pd.concat([category_test_1, category_test_0], ignore_index=True)
  category_test = category_test.sample(frac=1).reset_index(drop=True)

  if print_text:
    print(f'Train size: {len(category_train_1)*2}')
    print(f'Test size: {len(category_test_1)*2}')
    print('---'*3)

  # Concatenate the data into the corresponding DataFrames
  df_train = pd.concat([df_train, category_train], ignore_index=True)
  df_test = pd.concat([df_test, category_test], ignore_index=True)
  if print_text:
    print(f'DF Train size: {df_train.shape[0]}')
    print(f'DF Test size: {df_test.shape[0]}')
    print('\n')

  # Datos to Train and Test
  X_train = df_train.iloc[:, :-1].to_numpy()
  y_train = df_train['y'].to_numpy()

  X_test = df_test.iloc[:, :-1].to_numpy()
  y_test = df_test['y'].to_numpy()

  if print_text:
    print('Shapes X_train, y_train, X_test, y_test')
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

  # Create the model
  hidden_layers = [28, 14, 7, 3]
  model_MLPClassifier = MLPClassifier(hidden_layer_sizes=tuple(hidden_layers),
                                      activation='relu',
                                      verbose=verbose_print,
                                      solver='adam',
                                      batch_size= int(X_train.shape[0]/100),
                                      max_iter=100,
                                      learning_rate= 'adaptive', #'adaptive', 'constant',
                                      learning_rate_init=0.001,
                                      tol= 1e-3,#1e-5,
                                      early_stopping=True,
                                      shuffle=True,
                                      n_iter_no_change= 5,#10,
                                      validation_fraction=0.2)

  # Train the model
  print('---'*10)
  print(f'Trial: {count_trial}')
  model_MLPClassifier.fit(X_train, y_train)

  # End the timer
  end_time = time.time()

  # Calculate the elapsed time
  elapsed_time = end_time - start_time

  train_score = model_MLPClassifier.score(X_train, y_train)
  test_score = model_MLPClassifier.score(X_test, y_test)

  if train_score > target_score:
    if print_text_Training:
      print(f'Train score: {train_score:.4f}')
      print(f'Test score: {test_score:.4f}')
      print(f'Elapsed time: {elapsed_time:.2f} seconds')
      print('\n')

    # Save model
    if count_models < 10:
      Name = f'model_MLPClassifier_00{count_models}.pkl'
    elif count_models < 100:
      Name = f'model_MLPClassifier_0{count_models}.pkl'
    else:
      Name = f'model_MLPClassifier_{count_models}.pkl'

    joblib.dump(model_MLPClassifier, path_save_models + Name)
    print(f'---> Model saved as {Name}')
    print('\n')

    train_score_list.append(train_score)
    test_score_list.append(test_score)
    models_name_list.append(Name)
    elapsed_time_list.append(round(elapsed_time, 2))
    trial_list.append(count_trial)

    count_models += 1

  else:
    print(f'Train score: {train_score:.4f}')
    print(f'Elapsed time: {elapsed_time:.2f} seconds')
    print('No model was generated.')
    print('\n')

  count_trial += 1


Train size: 18602534
Test size: 4650634
---------
DF Train size: 18602534
DF Test size: 84725210


Shapes X_train, y_train, X_test, y_test
(18602534, 28) (18602534,) (84725210, 28) (84725210,)
------------------------------
Trial: 2
Iteration 1, loss = 0.68537879
Validation score: 0.592096
Iteration 2, loss = 0.67050364
Validation score: 0.601841
Iteration 3, loss = 0.65538055
Validation score: 0.607642
Iteration 4, loss = 0.64531297
Validation score: 0.607989
Iteration 5, loss = 0.64125853
Validation score: 0.627873
Iteration 6, loss = 0.63305932
Validation score: 0.634538
Iteration 7, loss = 0.62480806
Validation score: 0.642793
Iteration 8, loss = 0.61705932
Validation score: 0.617670
Iteration 9, loss = 0.61597833
Validation score: 0.658836
Iteration 10, loss = 0.60726972
Validation score: 0.630801
Iteration 11, loss = 0.60598977
Validation score: 0.661974
Iteration 12, loss = 0.60030458
Validation score: 0.661037
Iteration 13, loss = 0.59746357
Validation score: 0.664817
Iteration

# Df models

In [None]:
dict_model = {'Trial': trial_list,
              'Model': models_name_list,
              'Train score': train_score_list,
              'Test score': test_score_list,
              'Elapsed time': elapsed_time_list} # Total time per iteration

df_models = pd.DataFrame(dict_model)

df_models

Unnamed: 0,Trial,Model,Train score,Test score,Elapsed time
0,1,model_MLPClassifier_009.pkl,0.641587,0.684924,1375.63
1,2,model_MLPClassifier_010.pkl,0.665266,0.629175,2211.04


In [None]:
df_models.describe().round(2)

Unnamed: 0,Trial,Train score,Test score,Elapsed time
count,2.0,2.0,2.0,2.0
mean,1.5,0.65,0.66,1793.34
std,0.71,0.02,0.04,590.72
min,1.0,0.64,0.63,1375.63
25%,1.25,0.65,0.64,1584.48
50%,1.5,0.65,0.66,1793.34
75%,1.75,0.66,0.67,2002.19
max,2.0,0.67,0.68,2211.04


## Save Df

In [None]:
df_models.to_csv(path_save_models + 'df_30model_MLPClassifier.csv',
                 sep=';',
                 decimal=',',
                 index=False)

# End