# PSBC - grid search notebook

Goal: perform PSBC model evaluation on a grid in the folder "../Grids", on the main folder.

How: it creates a folder "vary_Nt", where different values of Nt in {1,2,4} are evaluated. 

Input: 
* Neumann: bool (True or False), denotes the type of Boundary condition.
* subordinate: bool (True or False, default is True) to describe the type of model
* parallel: bool (True or False, default is False), in case the model is parallel
* with_PCA: bool (True or False, default is False), in case a different basis matrix is used


Initially, we need to allow google drive to access the folders with files.

In [1]:
Colab = False #True

In [2]:
if Colab:
    from google.colab import drive
    drive.mount ('/content/drive')

Then we import the libraries we will need.

In [3]:
#import  matplotlib.pyplot as plt
import scipy.sparse as sc
import itertools as it
import pandas as pd
import numpy as np
import warnings
import shutil 
import copy
import glob
import sys
import os
import time
import tensorflow as tf
from tensorflow import keras
try: ## In order to open and save dictionaries, "dt": self.dt, "kind" : "Viscosity"
    import cPickle as pickle
except ImportError:  # python 3.x
    import pickle

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import multiprocess as mp
warnings.filterwarnings(action = "ignore", message = "internal issue")

At this moment we are in the folder


In [4]:
folder_now = os.getcwd ()
print (folder_now)

/home/rafa-monteiro/Desktop/2021/Research/Phase_separation_tensorflow/PSBC


We then move to the folder we need, and import the folder with all the libraries we will use.

In [8]:
if Colab: 
    os.chdir ("/content/drive/MyDrive/PSBC/")

sys.path.insert (0, "MOTHER_PSBC/")
folder_now = os.getcwd ()
print (folder_now)

/home/rafa-monteiro/Desktop/2021/Research/Phase_separation_tensorflow/Parameter_search


which are

In [9]:
from tfversion_binary_phase_separation import *
from tf_PSBC_extra_libs_for_training_and_grid_search import *

Now we access the appropriate folder

In [13]:
print ("Folder options are 'Neumann', 'Periodic', 'PCA_196', 'Classifier_196'")

which_folder = "Neumann"

try: 
    os.mkdir (which_folder)
except:
    pass

os.chdir (which_folder)
print (os.getcwd ())

Folder options are 'Neumann', 'Periodic', 'PCA_196', 'Classifier_196'
/home/rafa-monteiro/Desktop/2021/Research/Phase_separation_tensorflow/Parameter_search/Neumann


## Setting the parameters.

In [1]:
if which_folder in ['Neumann', 'Periodic']:
    # Recall that grid search happens for eps = 0, hence both models 
    # are the same, because no diffusion is in place
    Neumann = which_folder == 'Neumann'
    subordinate = True
    with_PCA = False     ### Funny thing:if you do "bool("False")" you get True 
    parallel = False
    cpu = 4  ## In case of parallel processing
    Nx = 784
elif which_folder == 'Neumann_non_subordinate':
    Neumann = True
    subordinate = False
    with_PCA = False     ### Funny thing:if you do "bool("False")" you get True 
    parallel = False
    cpu = 4  ## In case of parallel processing
    Nx = 784
elif which_folder == 'PCA_196':
    Neumann = True
    subordinate = True
    with_PCA = True     ### Funny thing:if you do "bool("False")" you get True 
    parallel = False
    cpu = 4  ## In case of parallel processing
    Nx = 784
elif which_folder == 'Classifier_196':
    Neumann = True
    Nt = 2
    save_history = True
    Nx = 784
    subordinate = True
    ###--------- INPUT -------------------------------------------------------------
    ### READ VARIABLES AND RETRIEVE TRAINING DATA (BOTH VARIABLES COMBINED)
    with_PCA = False
    cpu = int (4)
    parallel = False

grid_type = "grid_search" #"training"#
    
print ("The model will perform \n", grid_type,
       "\nwith the following parameters:\n* Neumann is",
       Neumann, "\n* with_PCA is", with_PCA,
       "\n* subordinate is", subordinate,"\n* parallel is", parallel)


NameError: name 'which_folder' is not defined

## Computations

In [None]:
if which_folder == 'PCA_196':
    nt_range  = [2]
    digits_range = [0,1]  ## Will be ignored
    pairs_of_digits = [(4,9), (3, 5)]    
elif which_folder == 'Classifier_196':
    nt_range  = [2]
    digits_range = np.arange (0,45)
else:
    nt_range = [1,2,4]
    digits_range = [0]    

{'EPOCHS': 10,
 'Neumann': True,
 'Nt': 4,
 'cv': 5,
 'patience': 10,
 'train_dt_P': True,
 'train_dt_U': True}

> **Remark** : if you are running this model in Colab you'd better use TPUs or GPUs to speed up grid search. In this case it is also convenient to break the processing in cases, doing one folder Nt at a time, or chopping the digits_range in pieces, in case of classifiers. 
In general, each batch evaluation runs pretty fast, and setting EPOCHS larger than 10 was a bit of an overkill. You can change that if you want. 

In [None]:
print ("\nGrid Search - ", which_folder)
for index in digits_range:
    for Nt in nt_range:
        try: 
            os.mkdir (str (Nt))
        except: 
            pass
        
        os.chdir (str (Nt))

        if which_folder in ['PCA_196', 'Classifier_196']:
            parameters_model = create_grid_for_search (Nt, "classifier")
        else:
            parameters_model = create_grid_for_search (Nt, "vary_Nt")

        ### RETRIEVE PARAMETERS
        filename = "grid_search_" + str (Neumann)+ "_" + str (Nt) + ".p"
        with open ("../../Grids/" + filename, 'rb') as pickled_dic:
            grid_range  = pickle.load (pickled_dic)

        with open ("../../Grids/digits_index.p", 'rb') as pickled_dic:
            grid_indexes  = pickle.load (pickled_dic)

        print (parameters_model)
        cv = grid_range ["cv"]

        ############################################
        print ("Asserting Nt")
        assert (grid_range["Nt"] == Nt)
        print ("Asserting Neumann")
        assert (grid_range["Neumann"] == Neumann)
        print (grid_range)
        ############################################

        EPOCHS = grid_range ["EPOCHS"]
        patience = grid_range ["patience"]
        Nt = grid_range ["Nt"]
        train_dt_U = grid_range ["train_dt_U"]
        train_dt_P = grid_range ["train_dt_P"]

        ###-----------------------------------------------------------------------------
        ### READ VARIABLES AND RETRIEVE TRAINING DATA (BOTH VARIABLES COMBINED)
        if with_PCA:
            variable_0, variable_1 = pairs_of_digits [index]
        else:
            variable_0, variable_1 = grid_indexes [index] 

        print ("\n* Number of cross valications :", cv)
        print ("Variables given:\n\tvariable_0 :",\
              variable_0,"\n\tvariable_1 :", variable_1)
        print ("\n* Parallel is", parallel,\
               ". (If parallel is True, then use ", cpu," cores.)")
        print ("\n* Nx :", Nx, ", Neumann :", Neumann,\
               ", Epochs : ", EPOCHS, ", Patience : ", patience)
        print ("\n* Nt :",  Nt, ", train_dt_U :",\
               train_dt_U, ", train_dt_P :", train_dt_P)
        print ("\n* with_PCA :", with_PCA)

        ###-----------------------------------------------------------------------------
    
        ## retrieve non-shuffled data
        S = select_split_pickle (level= 2)
        X_all, Y_all, _ = S.select_variables_from_pickle (variable_0, variable_1)

        ###  RETRIEVE TRAIN-TEST INDEXES and 
        file_name = "../../Pickled_datasets/generate_k_fold_" +\
        str (variable_0) + "_" + str (variable_1) + ".p"

        with open (file_name, 'rb') as pickled_dic:
            generate_k_fold = pickle.load (pickled_dic)

        results = []      
        #####################################################################
        ### BEGIN PARALLEL PROCESSING
        if parallel:
            print ("\nRUNNING THE MODEL IN PARALLEL")
            a = time.time ()
            pool = mp.Pool(cpu)
            for i in range(cv):
                ### Normalized and centralized (mean zero)
                train_index, test_index,\
                  mean_train_grid, Vstar, var_0_pickled, var_1_pickled =\
                generate_k_fold [str (i)]

                assert (variable_0 == var_0_pickled)
                assert (variable_1 == var_1_pickled)

                ### Split
                X_train_grid, Y_train_grid = X_all[train_index], Y_all[train_index]
                ### Centralization 
                #mean_train_grid = np.mean (X_train_grid, axis = 0)
                X_train_grid = X_train_grid - mean_train_grid
                X_test_grid, Y_test_grid =\
                X_all [test_index] - mean_train_grid, Y_all [test_index]

                ### Now run grid search IN PARALLEL
                print ("\nUsing", cpu, "processors")
                # Step 3: Use loop to parallelize
                args_now = (
                    i, X_train_grid, Y_train_grid, X_test_grid,\
                    Y_test_grid, parameters_model,\
                    Nx, Neumann, EPOCHS , patience, Nt,\
                    train_dt_U, train_dt_P, with_PCA, Vstar
                )
                results.append(
                    pool.apply_async(my_gridSearch_with_index, args = args_now ))
            
            # results is a list of pool.ApplyResult objects
            all_results = [r.get () for r in results]
            pool.close ()
            pool.join ()
            print ("\n It took", time.time () -a, "to run the model in parallel")
        else:
            print ("\nRUNNING THE MODEL IN SERIALLY")
            a = time.time ()
            for i in range(cv):
                ### Normalized and centralized (mean zero)
                train_index, test_index,\
                  mean_train_grid, Vstar, var_0_pickled, var_1_pickled =\
                generate_k_fold [str (i)]

                assert (variable_0 == var_0_pickled)
                assert (variable_1 == var_1_pickled)

                ### Split
                X_train_grid, Y_train_grid = X_all [train_index], Y_all [train_index]
                ### Centralization 
                X_train_grid = X_train_grid - mean_train_grid
                X_test_grid, Y_test_grid = X_all [test_index] - mean_train_grid, Y_all [test_index]

                ### Now run grid search IN PARALLEL
                print ("\nUsing", cpu, "processors")
                results.append (
                    my_gridSearch_with_index (
                        i, X_train_grid, Y_train_grid,\
                        X_test_grid, Y_test_grid,\
                        parameters_model, Nx, Neumann, EPOCHS,\
                        patience, Nt, train_dt_U, train_dt_P, with_PCA,\
                        Vstar
                    )
                )
            
            # results is a list of pool.ApplyResult objects
            all_results = results
            print ("\n It took", time.time () -a, "to run the model serially")

        #return results
        for j, a, b in all_results:
            if  j == 0:
                Accuracies, Parameters = a, b
            else:
                Accuracies_tmp, Parameters_tmp = a, b
                assert (Parameters_tmp == Parameters)
                Parameters = Parameters_tmp
                Accuracies = np.vstack ([Accuracies, Accuracies_tmp]) 

        try: os.mkdir ("grid_search")
        except: pass
            
        print ("Creating Accuracies and parameter pickled file")
        if with_PCA:
            file_name = "grid_search/PCA_all_grid_search_results_"\
              +str (variable_0) + "_" + str (variable_1) + ".p"
            with open (file_name, 'wb') as save:
                pickle.dump ( (Accuracies, Parameters), save,\
                             protocol = pickle.HIGHEST_PROTOCOL)        
            print ("Statistics pickled to ", file_name)
        else:
            file_name = "grid_search/Normal_all_grid_search_results"\
            + str (variable_0) + "_" + str (variable_1) + ".p"
            with open (file_name, 'wb') as save:
                pickle.dump ((Accuracies, Parameters), save,\
                             protocol = pickle.HIGHEST_PROTOCOL)        
                print ("Statistics pickled to ", file_name)

        os.chdir ("../")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 6/10


Accuracy on the validation data 0.8945914 

Epoch 7/10


Accuracy on the validation data 0.8941966 

Epoch 8/10


Accuracy on the validation data 0.89340705 

Epoch 9/10


Accuracy on the validation data 0.89340705 

Epoch 10/10


Accuracy on the validation data 0.89340705 


End of training

Saving validation data accuracy data
[0.5282274, 0.95578367, 0.9249901, 0.91077775, 0.9016976, 0.8965653, 0.8945914, 0.8941966, 0.89340705, 0.89340705, 0.89340705]
Maximal accuracy was 0.95578367

Varying parameters time 
 28
eps : 0.0 	dt : 0.2 	ptt_cardnlty : 392 	layers_K_shared : 4 	lr_U : 0.001 	lr_P : 0.01
Setting up a U layer with Neumann B.C.s.
Setting up a subordinate model with phase
Not saving best model

filepath weights/0_0_392_4_0_1_fold_4
Validation data available
Epoch 1/10
Training dt_U
Training dt_P
Training dt_U
Training dt_P


Accuracy on the validation data 0.8033952 

Epoch 2/10


Accuracy on the va