# Introduction

As a beginner, I made this notebook to present a generic approach to "play" with the concepts of machine learning and neural network. I have also tried to provide some clean Python code. To sum up, it is a synthesis of my current knowledge (and sorry for my English).

This is a first version which will be improved compete after compete.

The basic steps to define a 'Generic approach of Machine Learning' are:
1. Define the problem, I mean understand the data you got and define what are the inputs (attributes) and what is the output (target) of your Machine Learning
2. Summarize the dataset content in a statistical form
3. Prepare the dataset for your Machine Learning processing
4. Evaluate a set of algorithms based on you understanding of the data
5. Improve the results of your Machine Learning by refining the algorithms
6. Present the results of your Machine Learning
7. Deploy or save your Machine Learning

**NOTE: Please, feel free to correct and enhance this notebook ;)**

Before to start our notebook, we have to upgrade Python installer and to install some additional modules.

To define the problem, we have first to choose the subject we will work on. The point 1.b provides different datasets we can use to play. For each dataset, a comment describes the problem to address. 
We will consider two different problems:
1. One about classification (the basic one is the Iris classification)
2. One about regression (Melbourne housing prices)

This is the part that cannot be generic. The generic behavior proposed here is parameterized by the set of parameters defined in point b.1.

Note: In point b.1, to swith to another problem, just comment the current one and uncomment the problem to play with

Switching to Python code, the first step is to load all the required libraries (1.a) and to choose the problem to solve, let's say Iris classification or Melbourne House Prices regression.

In [None]:
from __future__ import division # Import floating-point division (1/4=0.25) instead of Euclidian division (1/4=0)

# 1. Prepare Problem

# a) Load libraries
import os, warnings, argparse, io, operator, requests, math, random, tempfile
import re # Regular expressions support
from datetime import datetime # Date & Time support

import numpy as np # Linear algebra
import matplotlib.pyplot as plt # Data visualization
import seaborn as sns # Enhanced data visualization
#import seaborn_image as isns # Enhanced image data visualization
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)

from pandas_profiling import ProfileReport

import sklearn
from sklearn import model_selection
from sklearn import linear_model # Regression
from sklearn import discriminant_analysis
from sklearn import neighbors # Clustering
from sklearn import naive_bayes
from sklearn import tree # Decisional tree learning
from sklearn import svm # Support Vector Machines
from sklearn import ensemble # Support RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor

import xgboost as xgb # Gradient Boosted Decision Trees algorithm

import lightgbm as lgb # Light Gradient Boost model

from sklearn.base import is_classifier, is_regressor # To check if the model is for regression or classification (do not work for Keras)

from sklearn.impute import SimpleImputer 

from sklearn.preprocessing import LabelEncoder # Labelling categorical feature from 0 to N_class - 1('object' type)
from sklearn.preprocessing import LabelBinarizer # Binary labelling of categorical feature
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.preprocessing import StandardScaler # Data normalization
from sklearn.preprocessing import MinMaxScaler # Data normalization
from sklearn.preprocessing import MaxAbsScaler # Data normalization

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import GridSearchCV

from sklearn.feature_selection import mutual_info_regression, mutual_info_classif # To build Mutual Information plots

from sklearn.inspection import permutation_importance

import pickle # Use to save and load models

import eli5
from eli5.sklearn import PermutationImportance

# Neural Network
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
import keras
from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
from keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array, array_to_img

from PIL import Image

First of all, we have to define the problem:
1. Understand the data, see point b.1) below
2. Prepare the basics of your code such as loading the libraries and your data, see points a) and b) below

In point b.1, we have a set of parameters strongly linked to the problem to solve. These parameters are used to configure the execution of 'Generic approach of Machine Learning':
- ML_NAME: The name of the Machine Learning (e.g. Titanic, Pima or Iris)
- DATABASE_URI: The root directory of the datasets (e.g. tpu-getting-started for Flower Classification with TPUs compete)
- DATABASE_NAME: The base name of the Images database
- COLUMNS_LABEL: Columns label of the dataset. Default: None, means that labels are already present in the loaded dataset
- TARGET_COLUMNS: The output column
- OUTPUT_IS_REGRESSION: Indicates if the ML is about either regression (True) or classification (False)
- DATE_TIME_COLUMNS: The list of the date/time column in customized format such as string format
- EXCLUDE_FROM_OULIERS: features to exclude from Ouliers processing
- NON_TRANSFORMABLE_COLUMNS: Indicates a list of columns which shall not be included in the transformation process (point 3.b)

# Description of the different projects of my playground

In [None]:
# b) Helpers

# Set execution control flags
from enum import IntFlag
class ExecutionFlags(IntFlag):
    """
    This class provides some execution control flags to enable/disable some part of the whole script execution
    """
    NONE                         = 0b00000000 # All flags disabled
    ALL                          = 0b11111111 # All flags enabled
    DATA_STAT_SUMMURIZE_FLAG     = 0b00000001 # Enable statitistical analyzis
    DATA_VISUALIZATION_FLAG      = 0b00000010 # Enable data visualization
    DATA_CLEANING_FLAG           = 0b00000100 # Enable data cleaning (feature engineering)
    DATA_TRANSFORM_FLAG          = 0b00001000 # Enable data transformation
    USE_NEURAL_NETWORK_FLAG      = 0b00010000 # Enable neural network models for Machine Learning
    USE_ONLY_NEURAL_NETWORK_FLAG = 0b00100000 # Use only neural network models for Machine Learning
    USE_CNN_NEURAL_NETWORK_FLAG  = 0b01000000 # Use neural network models for images based learning
                                              # This flag exclude all the others
    # End of class ExecutionFlags
    
# b.1) Define global parameters
# Regression

# Jan Tabular Playground Competition
"""
ML_NAME = 'JanTabularPlaygroundCompetition'
DATABASE_URI = None
COLUMNS_LABEL = None
COLUMNS_TO_DROP = ['id'] # Id is useless
TARGET_COLUMNS = 'target'
OUTPUT_IS_REGRESSION = True
DATE_TIME_COLUMNS = None
NON_TRANSFORMABLE_COLUMNS = None
"""
# To predict house price using the famous Melbourne housing dataset
"""
ML_NAME = 'MelbourneHousing'
DATABASE_URI = 'https://raw.githubusercontent.com/nagoya-foundation/r-functions-performance/master/data/Melbourne_housing_FULL.csv'
COLUMNS_LABEL = None
COLUMNS_TO_DROP = ['Address', 'Method', 'Postcode', 'CouncilArea', 'Propertycount', 'Regionname', 'SellerG', 'Suburb']
TARGET_COLUMNS = 'Price'
OUTPUT_IS_REGRESSION = True
DATE_TIME_COLUMNS = ['Date']
EXCLUDE_FROM_OULIERS = ['Lattitude', 'Longtitude']
NON_TRANSFORMABLE_COLUMNS = ['Lattitude', 'Longtitude']
# Suburb
# Address
# Rooms
# Type
# Price
# Method
# SellerG
# Date
# Distance
# Postcode
# Bedroom2
# Bathroom
# Car
# Landsize
# BuildingArea
# YearBuilt
# CouncilArea
# Lattitude
# Longtitude
# Regionname
# Propertycount
"""

# Classification

# To categorize an iris flower according to the dimensions of its sepals & petals 
"""
# Famous database; from Fisher, 1936
ML_NAME = 'Iris'
DATABASE_URI = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMNS_LABEL = ['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class']
COLUMNS_TO_DROP = None
TARGET_COLUMNS = 'class'
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
"""

# To predict survival on the Titanic
"""
ML_NAME = 'Titanic' # https://www.kaggle.com/c/titanic
DATABASE_URI = 'https://raw.githubusercontent.com/alexisperrier/packt-aml/master/ch4/titanic.csv'
TEST_FILE_NAME = '../input/titanic/test.csv'
COLUMNS_LABEL = None
COLUMNS_TO_DROP = ['PassengerId', 'Name', 'Ticket'] # PassengerId is useless, Name and Ticket will be processed in future version
    # We assume that Name,Ticket and are not relevant information
    # This can be confirm by the correlation matrix
TARGET_COLUMNS = 'Survived'
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
#  PassengerId: Unique passenger id
#  Survived: Survival status ('Yes' or 'No')
#  Pclass: The class the passeger belong (1st, 2nd or 3rd class)
#  Name: Name of the passenger
#  Sex: The sex of the passenger ('male' of 'female')
#  Age: The age of the passenger (in years)
#  SibSp: # of siblings / spouses aboard the Titanic
#  Parch: # of parents / children aboard the Titanic
#  Ticket: No description available for this field, perhaps the travel company identifier
#  Fare: Ticket price
#  Cabin: Identifier of the cabin. The first character identifies the deck.
#         This could be interesting fo the ML, creating a new feature Deck
#  Embarked: Port of Embarkation
"""

# This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ve years.
"""
# NOTE: Disable flag DATA_CLEANING_FLAG, this dataset is already ready to be used by ML 
ML_NAME = 'Pima'
DATABASE_URI = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
TEST_FILE_NAME = None
COLUMNS_LABEL = ['preg', 'plas', 'pres (mm Hg)', 'skin (mm)', 'test (mu U/ml)', 'mass', 'pedi', 'age (years)', 'class']
TARGET_COLUMNS = 'class'
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
# Use Neural Network for Machine Learning
#FLAGS = ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG & ExecutionFlags.DATA_TRANSFORM_FLAG & ExecutionFlags.DATA_VISUALIZATION_FLAG
# Use standard Machine Learning
FLAGS = ExecutionFlags.ALL & ~ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG & ~ExecutionFlags.USE_NEURAL_NETWORK_FLAG & ~ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG \
        #& ~ExecutionFlags.DATA_TRANSFORM_FLAG \
        #& ~ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG \
        #& ~ExecutionFlags.DATA_VISUALIZATION_FLAG \
        # Keep it empty for
#  preg = Number of times pregnant
#  plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
#  pres = Diastolic blood pressure
#  skin = Triceps skin fold thickness (mm)
#  test = 2-Hour serum insulin (mu U/ml)
#  mass = Body mass index (weight in kg/(height in m)^2)
#  pedi = Diabetes pedigree function
#  age = Age (years)
#  class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)
"""

# This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row
"""
ML_NAME = 'BrazilMedicalAppointments'
DATABASE_URI = 'https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.data.csv'
TARGET_COLUMNS = 'No-show'
COLUMNS_LABEL = None
OUTPUT_IS_REGRESSION = False
COLUMNS_TO_DROP = None
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
"""

# This dataset describes flower classification, using TPU.
# Require ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG
ML_NAME = 'FlowerClassification' # https://www.kaggle.com/c/tpu-getting-started
DATABASE_NAME = 'tpu-getting-started'
IMAGES_SAMPLES_NUM = 512 # Total number of images to be used for all datasets. None: Use all images
IMAGE_NUM_PIXELS = 512 # Image size in pixels
IMAGE_SIZE = [IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS] # (Heigh,Width) image size in pixels
IMAGE_CLASSES = [ # Classes name
                    'pink primrose', 'hard-leaved pocket orchid', 'canterbury bells', 'sweet pea',     'wild geranium',     'tiger lily',           'moon orchid',              'bird of paradise', 'monkshood',        'globe thistle',         # 00 - 09
                    'snapdragon', 'colt''s foot',               'king protea',      'spear thistle', 'yellow iris',       'globe-flower',         'purple coneflower',        'peruvian lily',    'balloon flower',   'giant white arum lily', # 10 - 19
                    'fire lily',        'pincushion flower',         'fritillary',       'red ginger',    'grape hyacinth',    'corn poppy',           'prince of wales feathers', 'stemless gentian', 'artichoke',        'sweet william',         # 20 - 29
                    'carnation',        'garden phlox',              'love in the mist', 'cosmos',        'alpine sea holly',  'ruby-lipped cattleya', 'cape flower',              'great masterwort', 'siam tulip',       'lenten rose',           # 30 - 39
                    'barberton daisy',  'daffodil',                  'sword lily',       'poinsettia',    'bolero deep blue',  'wallflower',           'marigold',                 'buttercup',        'daisy',            'common dandelion',      # 40 - 49
                    'petunia',          'wild pansy',                'primula',          'sunflower',     'lilac hibiscus',    'bishop of llandaff',   'gaura',                    'geranium',         'orange dahlia',    'pink-yellow dahlia',    # 50 - 59
                    'cautleya spicata', 'japanese anemone',          'black-eyed susan', 'silverbush',    'californian poppy', 'osteospermum',         'spring crocus',            'iris',             'windflower',       'tree poppy',            # 60 - 69
                    'gazania',          'azalea',                    'water lily',       'rose',          'thorn apple',       'morning glory',        'passion flower',           'lotus',            'toad lily',        'anthurium',             # 70 - 79
                    'frangipani',       'clematis',                  'hibiscus',         'columbine',     'desert-rose',       'tree mallow',          'magnolia',                 'cyclamen ',        'watercress',       'canna lily',            # 80 - 89
                    'hippeastrum ',     'bee balm',                  'pink quill',       'foxglove',      'bougainvillea',     'camellia',             'mallow',                   'mexican petunia',  'bromelia',         'blanket flower',        # 90 - 99
                    'trumpet creeper',  'blackberry lily',           'common tulip',     'wild rose'
                ]
TARGET_COLUMNS = None
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
FLAGS = ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG | ExecutionFlags.DATA_VISUALIZATION_FLAG # Use DL woth CNN (Images) 


# Bristol-Myers Squibb â€“ Molecular Translation
"""
# Require ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG
ML_NAME = 'BMS-MolecularTranslation' # https://www.kaggle.com/c/bms-molecular-translation
DATABASE_NAME = 'bms-molecular-translation'
COLUMNS_LABEL = ['image_id', 'class']
IMAGES_SAMPLES_NUM = None # Total number of images to be used for all datasets. None: Use all images
IMAGE_NUM_PIXELS = 512 # Image size in pixels
IMAGE_SIZE = [IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS] # (Heigh,Width) image size in pixels
IMAGE_CLASSES = None
TARGET_COLUMNS = None
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
FLAGS = ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG || ExecutionFlags.DATA_VISUALIZATION_FLAG # Use DL woth CNN (Images) 
"""

# Human Protein Atlas - Single Cell Classification (https://www.kaggle.com/c/hpa-single-cell-image-classification)
"""
# Require ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG
ML_NAME = 'HumanProteinAtlas' # https://www.kaggle.com/c/hpa-single-cell-image-classification
DATABASE_NAME = 'hpa-single-cell-image-classification'
COLUMNS_LABEL = ['image_id', 'class']
IMAGES_SAMPLES_NUM = 512 # Total number of images to be used for all datasets. None: Use all images
IMAGE_NUM_PIXELS = 512 # Image size in pixels
IMAGE_SIZE = [IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS] # (Heigh,Width) image size in pixels
IMAGE_CLASSES = [ # Classes name
                    'Nucleoplasm', 
                    'Nuclear membrane', 
                    'Nucleoli', 
                    'Nucleoli fibrillar center', 
                    'Nuclear speckles', 
                    'Nuclear bodies', 
                    'Endoplasmic reticulum', 
                    'Golgi apparatus', 
                    'Intermediate filaments', 
                    'Actin filaments', 
                    'Microtubules', 
                    'Mitotic spindle', 
                    'Centrosome', 
                    'Plasma membrane', 
                    'Mitochondria', 
                    'Aggresome', 
                    'Cytosol', 
                    'Vesicles and punctate cytosolic patterns', 
                    'Negative'
                ]
TARGET_COLUMNS = None
OUTPUT_IS_REGRESSION = False
DATE_TIME_COLUMNS = None
DATE_TIME_COLUMNS = None
EXCLUDE_FROM_OULIERS = None
NON_TRANSFORMABLE_COLUMNS = None
FLAGS = ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG | ExecutionFlags.DATA_VISUALIZATION_FLAG #& ~ExecutionFlags.DATA_VISUALIZATION_FLAG # Use DL woth CNN (Images)
# Channels:
# RED: Microtubule channels
# GREEN: Protein of interest
# BLUE: Nuclei channels
# YELLOW: Endoplasmic reticulum
"""

# Pre and Post processings

The functions below are some pre and post actions that will be executed during the different steps of the Machine Leaning/Deep Learning training and validation.

These functions have to be completed by yourself after learning from dataset Descriptive statistics (point 2.a) and dataset visualization (point 2.b) steps in order to:
- Remove useless features after loading dataset (kaggle_post_load_datasets (point 1.c))
- Apply some feature processing rugth after the datasets are loaded (kaggle_post_load_datasets (point 1.c))
- Apply early features selection (kaggle_post_load_datasets (point 1.c))
- Apply feature processing resulting of the data analyzing (kaggle_pre_features_engineering)
- Create new features

In [None]:
def kaggle_pre_main() -> None:
    print('----------------------------- kaggle_pre_main -----------------------------')
    """
    This function is called at the begining of the main procedure 
    E.g. Install some specific packages...
    """
    print('kaggle_pre_main: Done')

def kaggle_post_main() -> None:
    """
    This function is called at the termination of the main procedure 
    E.g. Uninstall some specific packages, cleanup, publishing...
    """
    print('----------------------------- kaggle_post_main -----------------------------')
    print('kaggle_post_main: Done')

def kaggle_pre_load_datasets() -> None:
    """
    This function is called by kaggle_load_datasets() just before to start processing
    E.g. Rename or reorganize the datasets...
    """
    print('----------------------------- kaggle_pre_load_datasets -----------------------------')
    print('kaggle_pre_load_datasets: Done')

def kaggle_post_load_datasets(p_train_df, p_validation_df, p_test_df) -> list:
    """
    This function is called by kaggle_load_datasets() just after the datasets were loaded
    """
    print('----------------------------- kaggle_post_load_datasets -----------------------------')

    # Refactor and enhance dataset content
    if ML_NAME == 'Titanic':
        print('kaggle_post_load_datasets: Replacing impossible 0 values by NaN value')
        # Create a category U for Unknown and just keep the deck indetifier
        p_train_df['Cabin'] = p_train_df['Cabin'].apply(lambda p_value : p_value[0:1] if not p_value is np.NaN else 'U') 
        p_validation_df['Cabin'] = p_validation_df['Cabin'].apply(lambda p_value : p_value[0:1] if not p_value is np.NaN else 'U') 
        p_test_df['Cabin'] = p_test_df['Cabin'].apply(lambda p_value : p_value[0:1] if not p_value is np.NaN else 'U') 

    # Drop columns if any
    if not COLUMNS_TO_DROP is None:
        p_train_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
        p_validation_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
        p_test_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
    
    # Keep only features selection
    if ML_NAME == 'MelbourneHousing':
        l = list(set(p_train_df.columns) - set(['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude', 'Date', 'Distance', 'Car']))
        l.remove(TARGET_COLUMNS)
        if len(l) != 0:
            p_train_df.drop(l, inplace = True, axis = 1) # train_df and validation_df contain TARGET column
            p_validation_df.drop(l, inplace = True, axis = 1)
            p_test_df.drop(l, inplace = True, axis = 1) # test_df does not
        
    print('kaggle_post_load_datasets: Done')
    return p_train_df, p_validation_df, p_test_df

def kaggle_pre_load_images_datasets(p_train_url: str, 
                                    p_labels: list = None,
                                    p_global_path: str = None,
                                    p_validation_url: str = None,
                                    p_test_url: str = None,
                                    p_train_size: float = 0.9
                                    ) -> list:
    """
    This function is called by kaggle_load_images_datasets() just before to start processing
    The Images database shall be organized as follow:
    1) In Tfrecord format
        <current_folder>/tfrecords-jpeg-<n>x<n>/train
        <current_folder>/tfrecords-jpeg-<n>x<n>/val
        <current_folder>/tfrecords-jpeg-<n>x<n>/test
    2) Otherwise:
        <current_folder>/train_labels.csv
        <current_folder>/train
        [<current_folder>/train], could not exist
        <current_folder>/test
    :parameters p_labels: The label of the columns to be used. Default: None
    :parameter p_train_size: Size ratio of Training dataset vs. Validation dataset
    """
    print('----------------------------- kaggle_pre_load_images_datasets -----------------------------')
    train_df = None
    validation_df = None
    test_df = None
    
    if not p_train_url is None: # Case of separated Images and labels
        print('kaggle_pre_load_images_datasets: Processing %s' % ML_NAME)
        path = p_global_path #os.path.join(p_global_path, p_train_url) # p_train_url = DATABASE_NAME
        if ML_NAME == 'HumanProteinAtlas':
            # Add a 'full path' feature with 4 color channels full path
            p = os.path.join(path, 'train.csv')
            train_df = pd.read_csv(p)
            # Set labels
            if not p_labels is None:
                train_df.columns = p_labels
            # Prepare Training, Validation and Test datasets
            print('----------------------------- training dataset')
            # Build full path images
            p = os.path.join(path, 'train')
            train_df['image_path'] = train_df['image_id'].apply(lambda x: p + '/%s' % x)
            train_df.drop(['image_id'], inplace = True, axis = 1)
            # Split labels into classes as descibed by IMAGE_CLASSES
            mlb = MultiLabelBinarizer()
            mlb.fit([IMAGE_CLASSES])
            print('kaggle_pre_load_images_datasets: mlb:', mlb.classes_)
            train_df['class'] = train_df['class'].apply(lambda x: np.squeeze(mlb.transform([list(map(int, x.split('|')))])).astype(np.int8))
            # Read the test files and build the files list
            print('----------------------------- kaggle_pre_load_images_datasets: test dataset')
            if not p_test_url is None:
                p = os.path.join(path, 'test')
                test_files = tf.io.gfile.glob(os.path.join(p, '*'))
                # Group files by channels (blue, green, red and yellow)
                test_files.sort()
                test_files = [(test_files[i].split('_')[0], os.path.basename(test_files[i].split('_')[0])) for i in range(0, len(test_files), 4)]
                # Put them in dataset
                test_df = pd.DataFrame(test_files, columns = ['image_path', 'id'])
            else:
                raise Exception('kaggle_pre_load_images_datasets', 'Test url not provided')
        elif  ML_NAME == 'BMS-MolecularTranslation':
            # Add a 'full path' feature
            p = os.path.join(path, 'train_labels.csv')
            train_df = pd.read_csv(p)
            # Set labels
            if not p_labels is None:
                train_df.columns = p_labels
            train_df['full_path'] = train_df['image_id'].apply(lambda x: p + '/%c/%c/%c/%s.png' % (x[0], x[1], x[2], x))
            # FIXME Process test_df 
            raise Exception('kaggle_pre_load_images_datasets', 'FIXME Process test_df')
        else:
            raise Exception('kaggle_pre_load_images_datasets', 'Unsupported ML_NAME: %s' % ML_NAME)
    
        if not p_validation_url is None:
            p = os.path.join(path, p_validation_url)
            if  ML_NAME == 'BMS-MolecularTranslation':
                # Add a 'full path' feature
                p = os.path.join(path, 'validation_labels.csv')
                validation_df = pd.read_csv(p)
                # Set labels
                if not p_labels is None:
                    validation_df.columns = p_labels
                validation_df['full_path'] = validation_df['image_id'].apply(lambda x: p + '/%c/%c/%c/%s.png' % (x[0], x[1], x[2], x))
        else: # Need to extract randomly Validation datatset from Training dataset
            train_df, validation_df = model_selection.train_test_split(train_df, test_size = p_train_size)
            # reindex after split
            train_df.reset_index(inplace = True)
            validation_df.reset_index(inplace = True)

        #print('----------------------------- kaggle_pre_load_images_datasets: training dataset')
        #print(train_df.head())
        #print('----------------------------- kaggle_pre_load_images_datasets: validation dataset')
        #print(validation_df.head())
        #print('----------------------------- kaggle_pre_load_images_datasets: test dataset')
        #print(test_df.head())
    else:
        pass

    print('kaggle_pre_load_images_datasets: Done')
    return train_df, validation_df, test_df

def kaggle_post_load_images_datasets(p_train_df, 
                                     p_validation_df, 
                                     p_train_df_num_images: int, 
                                     p_validation_df_num_images: int
                                    ) -> list:
    print('----------------------------- kaggle_post_load_images_datasets -----------------------------')
    print('kaggle_post_load_images_datasets: ', type(p_train_df))

    print('kaggle_post_load_images_datasets: Done')
    return p_train_df, p_validation_df, p_train_df_num_images, p_validation_df_num_images

def kaggle_pre_features_engineering(p_df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    Apply feature processing resulting of the data analyzing just before to start data engineering (point 3.a).
    """
    print('----------------------------- kaggle_pre_features_engineering -----------------------------')
    print(p_df.describe().T)

    if ML_NAME == 'Pima': 
        # FEATURES_PROCESSING
        print('kaggle_pre_features_engineering: Replacing impossible 0 values by NaN value')
        p_df['plas'] = p_df['plas'].apply(lambda p_value : np.NaN if p_value == 0 else p_value)
        p_df['pres (mm Hg)'] = p_df['pres (mm Hg)'].apply(lambda p_value : np.NaN if p_value == 0 else p_value)
        p_df['skin (mm)'] = p_df['skin (mm)'].apply(lambda p_value : np.NaN if p_value == 0 else p_value)
        p_df['test (mu U/ml)'] = p_df['test (mu U/ml)'].apply(lambda p_value : np.NaN if p_value == 0 else p_value)
        p_df['mass'] = p_df['mass'].apply(lambda p_value : np.NaN if p_value == 0 else p_value)
    elif ML_NAME == 'Titanic':
        # S has the higher cardinality (see kaggle_summurize_data: distribution of categorical features)
        p_df['Embarked'] = p_df['Embarked'].apply(lambda p_value : p_value[0:1] if not p_value is np.NaN else 'S')
        print('----------------------------- kaggle_pre_features_engineering: Features creation')
        p_df['FamilySize'] = p_df.apply(lambda p_df : p_df['SibSp'] + p_df['Parch'] + 1, axis = 1)
        # Create class of ages based on common Age distribution, NaN values will be imputed
        p_df['AgeClass'] = p_df.apply(lambda p_df : 'Senior' if p_df['Age'] >= 60 else 'Adult' if p_df['Age'] >= 35 else 'Young Adult' if p_df['Age'] >= 25 else 'Teen' if p_df['Age'] >= 14 else 'Child' if p_df['Age'] >= 4 else 'Baby', axis = 1)
        # Create class of fare based on discussion below
        p_df['FareClass'] = p_df.apply(lambda p_df : 'Very Expensive' if p_df['Fare'] >= (3*512/4) else 'Expensive' if p_df['Fare'] >= (512/2) and p_df['Fare'] < (3*512/4) else 'Chip' if p_df['Fare'] < (512/2) and p_df['Fare'] >= (512/4) else 'Very Chip', axis = 1)
        print('----------------------------- kaggle_pre_features_engineering: Features deletion')        
        # SibSp and Parch were repaced by FamilySize, Age by AgeClass and Fare by FareClass
        p_df.drop(['SibSp', 'Parch', 'Age', 'Fare'], inplace = True, axis = 1)
        print('----------------------------- kaggle_pre_features_engineering: After features creation/deletion:')
        #print(p_df.head())
        print(p_df.describe().T)

    print('kaggle_pre_features_engineering: Done')
    return p_df

def kaggle_post_features_engineering(p_df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    Apply feature processing resulting of the data analyzing at the end of the data engineering (point 3.a).
    """
    print('----------------------------- kaggle_post_features_engineering -----------------------------')
    print('kaggle_post_features_engineering: Done')
    return p_df

# Loading the Datasets/Images

Before to load and to examine our datasets (point c.1), we are just going to set a number of defaults such as the settings for the plotting operation, Deep Learning parameters... (point b.2)

Notes:
- Point b.4 provides some helpers functions to optimize Neural Networks executoin using TPUs or GPUs.  TPU stands for Tensorflow Processor Unit

In [None]:
# b.2) Set some defaults
def kaggle_set_mp_default() -> None:
    """
    Some default setting for Matplotlib plots
    """
    warnings.filterwarnings("ignore") # to clean up output cells
    pd.set_option('precision', 3)
    # Set Matplotlib defaults
    plt.rc('figure', autolayout=True)
    plt.rc('axes', labelweight='bold', labelsize='large', titleweight='bold', titlesize=18, titlepad=10)
    plt.rc('image', cmap='magma')
    # End of function set_mp_default

# Fix random values for reproductibility
SEED_HARCODED_VALUE = 42

def kaggle_set_seed(p_seed: int = SEED_HARCODED_VALUE) -> None:
    """
    Random reproducability
    :parameter p_seed: Set the seed value for random functions reproductibility
    """
    seed = p_seed
    if seed is None:
        seed = random.randint(0, 10000) 
    np.random.seed(seed)
    sklearn.utils.check_random_state(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    # End of function set_seed

def kaggle_modules_versions() -> None:
    """
    Print the different modules version
    """
    print('----------------------------- modules_versions -----------------------------')
    print("Numpy version: " + np.__version__)
    print('seaborn: %s' % sns.__version__)
    print("Pandas version: " + pd.__version__)
    print("Sklearn version: " + sklearn.__version__)
    print("Tensorflow version: " + tf.__version__)
    print('modules_versions: Done')
    # End of function modules_versions

In case of working with Neural Networks, set some additional defaults and doing a TPU detection are required.

In [None]:
# b.4) Set additional defaults for Neural Networks
DL_BATCH_SIZE = 16 # Default batch size for DL models 
DL_EPOCH_NUM = 32 # Default epoch number for DL models 
DL_LEARNING_RATE = 0.002 # Default learning rate for DL models 
DL_DROP_RATE = 0.2 # Default drop rate for DL models 
DL_INPUT_SHAPE = None # Default input shape size for DL models, will be used for cross_validation (see kaggle_check_models)

# Doing TPU detection
def kaggle_tpu_detection():
    """
    This function provides a TPU detection. If TPU is not supported, a default strategy is returned.
    Note: This method also setup the global path to access datasets when TPU (from Kaggle environment) is detected.
    :return: The appropriate distribution strategy and the global path to access datasets (None if no TPU support)
    """
    print('----------------------------- kaggle_tpu_detection -----------------------------')
    global_path = None
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
        print('kaggle_tpu_detection: Running on TPU ', tpu.master())
    except ValueError:
        tpu = None
    if tpu:
        print("kaggle_tpu_detection: List of devices: ", tf.config.list_logical_devices('TPU'))
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
        if strategy.num_replicas_in_sync > 1:
            # Update paths accordingly
            from kaggle_datasets import KaggleDatasets
            global_path = KaggleDatasets().get_gcs_path(DATABASE_NAME)
    else:
        strategy = tf.distribute.get_strategy() 
    print('kaggle_tpu_detection: replica=%s' % str(strategy.num_replicas_in_sync))
    print('kaggle_tpu_detection: global_path: ', global_path)
    
    print('kaggle_tpu_detection Done')
    return strategy, global_path
    # End of function kaggle_tpu_detection

# Tensorflow specific helper functions
def kaggle_bytes_to_tfrecord(p_value):
    """
    Returns a bytes_list from a string/byte
    """
    if isinstance(p_value, type(tf.constant(0))):
        p_value = p_value.numpy()
    return tf.train.Feature(bytes_list = tf.train.BytesList(value = [p_value]))
    # End of function kaggle_bytes_to_tfrecord

def kaggle_floats_to_tfrecord(p_value):
    """
    Returns a float_list from a float/double
    """
    return tf.train.Feature(float_list = tf.train.FloatList(value = [p_value]))
    # End of function kaggle_floats_to_tfrecord

def kaggle_ints_to_tfrecord(p_value):
    """
    Returns an int64_list from a bool/enum/int/uint
    """
    return tf.train.Feature(int64_list = tf.train.Int64List(value = [p_value]))
    # End of function kaggle_ints_to_tfrecord

def kaggle_dataset_size(p_filenames: list) -> int:
    """
    The number of data items in the dataset is written in the name of the .tfrec files (i.g. features00-230.tfrec = 230 data items in the file features00-230.tfrec)
    """
    print('----------------------------- kaggle_dataset_size -----------------------------')
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in p_filenames]
    print('kaggle_dataset_size: n=', n)
    print('kaggle_dataset_size: np.sum(n)=', np.sum(n))
    return np.sum(n)
    # End of function kaggle_dataset_size

def kaggle_decode_png_image(p_image_data, p_channels: int = 3):
    """
    This function decodes the PNG image and applies required Tensorflow adjustment
    Note: The image was stored in a Tfrecord dataset
    :parameter p_image_data: 
    :parameter p_channels: 
    :return: The decoded image
    """
    image = tf.image.decode_png(p_image_data, channels = p_channels)
    # Convert image to floats in [0, 1] range
    image = tf.cast(image, tf.float32) / 255.0
    # Explicit size needed for TPU
    image = tf.reshape(image, [*IMAGE_SIZE, p_channels])
    return image
    # End of function kaggle_decode_png_image

def kaggle_decode_jpeg_image(p_image_data, p_channels: int = 3):
    """
    This function decodes the JPEG image and applies required Tensorflow adjustment
    Note: The image was stored in a Tfrecord dataset
    :parameter p_image_data: 
    :parameter p_channels: 
    :return: The decoded image
    """
    image = tf.image.decode_jpeg(p_image_data, channels = p_channels)
    # Convert image to floats in [0, 1] range
    image = tf.cast(image, tf.float32) / 255.0
    # Explicit size needed for TPU
    image = tf.reshape(image, [*IMAGE_SIZE, p_channels])
    return image
    # End of function kaggle_decode_jpeg_image

# Tensorflow Protocol buffer for unlabeled images stroed into a Tfrecord dataset.
# Labels are replaced by identifiers
unlabeled_image_feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
    'id': tf.io.FixedLenFeature([], tf.string),  # shape [] means single element
}
def kaggle_read_unlabeled_tfrecord(p_data):
    data = tf.io.parse_single_example(p_data, unlabeled_image_feature_description)
    image = kaggle_decode_jpeg_image(data['image'])
    idnum = data['id']
    return image, idnum # returns a dataset of image(s)
    # End of function kaggle_read_unlabeled_tfrecord

# Tensorflow Protocol buffer for labeled images stroed into a Tfrecord dataset
labeled_image_feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
    'class': tf.io.FixedLenFeature([], tf.int64),  # shape [] means single element
}
def kaggle_read_labeled_tfrecord(p_data):
    data = tf.io.parse_single_example(p_data, labeled_image_feature_description)
    image = kaggle_decode_jpeg_image(data['image'])
    label = tf.cast(data['class'], tf.int32)
    return image, label # returns a dataset of (image, label) pairs
    # End of function kaggle_read_labeled_tfrecord

# FIXME Development in progress
def kaggle_build_tfrec_dataset(p_train_url: str, p_validation_url: str, p_test_url: str):
    """
    This function is called by kaggle_load_images_datasets() just before to start processing
    The Images database shall be organized as follow:
    1) In Tfrecord format
        <current_folder>/tfrecords-jpeg-<n>x<n>/train
        <current_folder>/tfrecords-jpeg-<n>x<n>/val
        <current_folder>/tfrecords-jpeg-<n>x<n>/test
    2) Otherwose:
        <current_folder>/train_labels.csv
        <current_folder>/train
        [<current_folder>/train], could not exist
        <current_folder>/test
    """
    print('----------------------------- kaggle_build_tfrec_dataset -----------------------------')
    if not p_train_url is None and p_validation_url is None and p_test_url is None:
        if ML_NAME == 'BMS-MolecularTranslation':
            # Convert separated Images/Labels in TFrecord format
            path = os.path.join(os.path.abspath(os.getcwd()), '../input')
            path = os.path.join(path, p_train_url)
            print('kaggle_build_tfrec_dataset: path=', path)
            # Process Training dataset
            # 1. Load labels
            print('kaggle_build_tfrec_dataset: Loading train labels')
            labels_df = pd.read_csv(os.path.join(path, 'train_labels.csv'))
            labels_df['full_path'] = labels_df['image_id'].apply(lambda x: './train/%c/%c/%c/%s.png' % (x[0], x[1], x[2], x))
            print(labels_df.head())
            print(labels_df.tail())
            raise Exception('Stop')
            # 2. Create Tfrecord dataset for training
            print('kaggle_build_tfrec_dataset: Creating Tfrecord dataset for training')
            path = os.path.join(path, 'train')
            modulus = 512
            counter = 0
            i = 0
            tw = tf.io.TFRecordWriter('./%.2i-%ix%i-%i.tfrec' % (counter, IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS, modulus))
            for root, subdirs, files in os.walk(path):
                for filename in files:
                    file_path = os.path.join(root, filename)
                    # load the image
                    img = load_img(file_path, (IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS))
                    # Convert to numpy array
                    pixels = np.asarray(img) #<class 'numpy.ndarray'>
                    # Confirm pixel range is 0-255
                    #print('Data Type: %s' % pixels.dtype)
                    #print('Min: %.3f, Max: %.3f' % (pixels.min(), pixels.max()))
                    # Convert from integers to floats
                    pixels = pixels.astype('float32')
                    row = labels_df.loc[labels_df.image_id == os.path.splitext(os.path.basename(filename))[0], ['InChI']]
                    #kaggle_display_image_and_component(pixels, filename)
                    feature = {
                        'image': kaggle_bytes_to_tfrecord(pixels.tobytes()),
                        'class': kaggle_bytes_to_tfrecord(str.encode(row['InChI'].tolist()[0]))
                    }
                    record_bytes = tf.train.Example(features = tf.train.Features(feature = feature)).SerializeToString()
                    tw.write(record_bytes)
                    i += 1
                    if i % modulus == 0:
                        tw.close()
                        print('kaggle_build_tfrec_dataset: Writing TFRecord %i of %i...'%(counter, i))
                        i = 0 # Reset i
                        counter += 1
                        tw = tf.io.TFRecordWriter('./%.2i-%ix%i-%i.tfrec' % (counter, IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS, modulus))
                    # End of 'for' statement
                # End of 'for' statement
            print('kaggle_build_tfrec_dataset: counter=%d, i=%d' % (counter, i))
            tw.close()
            if (i % modulus != 0):
                os.rename(
                    './%.2i-%ix%i-%i.tfrec' % (counter, IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS, modulus),
                    './%.2i-%ix%i-%i.tfrec' % (counter, IMAGE_NUM_PIXELS, IMAGE_NUM_PIXELS, i)
                         )
            print('kaggle_build_tfrec_dataset: frecord dataset for training: Done')
            # 3. Create Tfrecord dataset for test
            print('kaggle_build_tfrec_dataset: Creating Tfrecord dataset for test')

            print('kaggle_build_tfrec_dataset: Tfrecord dataset for test: Done')
            # End of 'if' statement, ML_NAME == 'BMS-MolecularTranslation'
        raise Exception('Stop')
    else:
        raise Exception('kaggle_build_tfrec_dataset', 'Wrong parameters')
    print('kaggle_build_tfrec_dataset: Done')
    # End of function kaggle_build_tfrec_dataset

Now, we are ready to load our dataset and examine it to understand the data it contains. This function accept any URI (e.g. file:///... or http://... or https://...).

Loading the dataset, you can specify or overwrite columns labels.

According to the data analyzing, you can also define some post loading processing using lambda function (see kaggle_post_load_datasets).

The function kaggle_load_datasets() splits the data into three datasets:
- Training dataset used to train the model(size fixed by p_train_size, default is 90%)
- Test dataset use to test the mode with unseen data (size is (100 - p_train_size), default is 10%)
- Training dataset is splitted again into Training dataset (80%) and  Validation dataset used to fit the model (size is 20%)

Note: The Test dataset does not contain target features (see TARGET_COLUMNS)

In [None]:
# c.1) Load 'Data' dataset
def kaggle_load_datasets(p_url: str, 
                         p_labels: list = None, 
                         p_train_path: str = None, 
                         p_validation_path: str = None,
                         p_train_size: float = 0.9,
                         p_seed: int = SEED_HARCODED_VALUE
                        ) -> list:
    """
    This function load the dataset specified by p_url or (p_train_path, p_validation_path) ina case of Kaggle compete.
    It also add the labels if required and apply post load processing of the datatsets if required
    :parameters p_url: The URI of the dataset (http:// or file://)
    :parameters p_labels: The label of the columns to be used. Default: None
    :parameters p_train_path: Kaggle specific path for train dataset
    :parameters p_validation_path: Kaggle specific path for validation dataset
    :parameter p_seed: The seed value for reproductibility
    :return: Four datasets: The Training, Validation and Test datasets. The Test dataset does no contain the outputs, it acts as unseen data for the model. This is the fourth dataset returned
    
    :exception: Raised if specified link is not correct
    """
    print('----------------------------- kaggle_load_datasets -----------------------------')
    train_df = None
    validation_df = None 
    test_df = None
    y_test_df = None
    
    kaggle_pre_load_datasets()

    if not p_train_path is None and not p_validation_path is None:
        # Kaggle compete specific
        train_df = pd.read_csv(p_train_path)
        test_df = pd.read_csv(p_validation_path) # y_test_df will be None for the Kaggle compete
        # Set labels
        if not p_labels is None:
            df.columns = p_labels
        # Split train_df into Training and Test datasets
        y_train_df = train_df[TARGET_COLUMNS]
        x_train_df = train_df.drop([TARGET_COLUMNS], axis = 1)
        X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(x_train_df, y_train_df, train_size = 0.8, random_state = p_seed)               
        train_df = pd.concat([X_train, Y_train], axis = 1)
        validation_df = pd.concat([X_validation, Y_validation], axis = 1)       
    else:
        # Get the data
        if p_url.startswith('file://'):
            df = pd.read_csv(p_url[7:])
        elif p_url.startswith('http'):
            ds = requests.get(p_url).content
            df = pd.read_csv(io.StringIO(ds.decode('utf-8')))
        if df is None:
            raise Exception('kaggle_load_datasets: Failed to load data frame', 'url=%s' % (url))
        # Set labels
        if not p_labels is None:
            df.columns = p_labels
        # Split them into Training, Test and Validation datasets
        y_df = df[TARGET_COLUMNS]
        x_df = df.drop([TARGET_COLUMNS], axis = 1)
        X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x_df, y_df, train_size = p_train_size, random_state = p_seed)
        train_df = pd.concat([X_train, Y_train], axis = 1)
        test_df = X_test
        y_test_df = Y_test

        y_df = train_df[TARGET_COLUMNS]
        x_df = train_df.drop([TARGET_COLUMNS], axis = 1)
        X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(x_df, y_df, train_size = 0.8, random_state = p_seed)
        train_df = pd.concat([X_train, Y_train], axis = 1)
        validation_df = pd.concat([X_validation, Y_validation], axis = 1)

    #print('----------------------------- kaggle_load_datasets: training dataset')
    #print(train_df.describe().T)
    #print('----------------------------- kaggle_load_datasets: validation dataset')
    #print(validation_df.describe().T)
    #print('----------------------------- kaggle_load_datasets: test dataset')
    #print(test_df.describe().T)
    
    # Apply post processing after loading dataset
    train_df, validation_df, test_df = kaggle_post_load_datasets(train_df, validation_df, test_df)

    print('----------------------------- kaggle_load_datasets: training dataset')
    print(train_df.head())
    print('----------------------------- kaggle_load_datasets: validation dataset')
    print(validation_df.head())
    print('----------------------------- kaggle_load_datasets: test dataset')
    print(test_df.head())

    print('kaggle_load_datasets: Done: %s' % (p_url if not p_url is None else p_train_path))
    return train_df, validation_df, test_df, y_test_df
    # End of function kaggle_load_datasets

The function bellow loads images dataset. The images shall be store in one of the following format:
- Separated images/labels with one of the following format:
1. One folder for each of Training, Validation and Test datasets. Each folder contains images in PNG or JPEG format and the labels in in a CSV file
2. TODO...
- TensorFlow 'tfrec' format, with or without label defined.

In [None]:
# c.2) Load 'Images' dataset
def kaggle_load_images_datasets(p_train_url: str, 
                                p_labels: list = None,
                                p_global_path: str = None,
                                p_validation_url: str = None,
                                p_test_url: str = None, 
                                p_train_path: str = None, 
                                p_validation_path: str = None,
                                p_test_path: str = None,
                                p_train_size: float = 0.9,
                                p_ordered: bool = False,
                                p_seed: int = SEED_HARCODED_VALUE
                                ) -> list:
    """
    This function loads Tensorflow Tfrecord datasets.
    
    :parameters p_labels: The label of the columns to be used. Default: None
    
    
    :parameter p_train_path: Path of the Training 'Tfrec' folder  
    :parameter p_validation_path: Path of the Validation 'Tfrec' folder
    :parameter p_test_path: Path of the Test 'Tfrec' folder
    :parameter p_train_size: Size ratio of Training dataset vs. Validation dataset
    :parameter p_seed: The seed value for reproductibility
    :return: Four datasets: The Training, Test and Validation datasets. The Test dataset does no contain the outputs, it acts as unseen data for the model. This is the fourth dataset returned
    """
    print('----------------------------- kaggle_load_images_datasets -----------------------------')

    train_df, validation_df, test_df = kaggle_pre_load_images_datasets(p_train_url = p_train_url, 
                                                                       p_labels = p_labels, 
                                                                       p_global_path = p_global_path,
                                                                       p_validation_url = p_validation_url, 
                                                                       p_test_url = p_test_url,
                                                                       p_train_size = p_train_size
                                                                       )
    #print('----------------------------- kaggle_load_images_datasets: training dataset')
    #print(train_df.head())
    #print('----------------------------- kaggle_load_images_datasets: validation dataset')
    #print(validation_df.head())
    #print('----------------------------- kaggle_load_images_datasets: test dataset')
    #print(test_df.head())

    train_df_num_images = None
    validation_df_num_images = None
    test_df_num_images = None
    if not train_df is None:
        print('kaggle_load_images_datasets: Processing compete %s:' % ML_NAME)
        if ML_NAME == 'HumanProteinAtlas':
            def _rebuild_image_from_channels(p_image_path: str, p_label: list = None) -> list:
                """
                This function convert image channels into one RGB image
                """
                red = tf.io.read_file(p_image_path + "_red.png")
                blue = tf.io.read_file(p_image_path + "_blue.png")
                green = tf.io.read_file(p_image_path + "_green.png")
                yellow = tf.io.read_file(p_image_path + "_yellow.png")

                red = tf.io.decode_png(red, channels = 1) # Grayscale image
                blue = tf.io.decode_png(blue, channels = 1) # Grayscale image
                green = tf.io.decode_png(green, channels = 1) # Grayscale image
                yellow = tf.io.decode_png(yellow, channels = 1) # Grayscale image
                
                red = tf.math.maximum(red, yellow)
                blue = tf.math.maximum(blue, yellow)

                # Convert image to floats in [0, 1] range
                red = tf.cast(red, tf.float32) / 255.0
                blue = tf.cast(blue, tf.float32) / 255.0
                green = tf.cast(green, tf.float32) / 255.0
                
                # Explicit size needed for TPU
                red = tf.image.resize(red, [*IMAGE_SIZE])
                blue = tf.image.resize(blue, [*IMAGE_SIZE])
                green = tf.image.resize(green, [*IMAGE_SIZE])

                # Build RGB image, channels last: shape is (width, high, channel)
                # In this case, shape is (1024, 1024, 3):
                #     1024 entries of (1024 lines per 3 columns)
                #         First entry, first line:   [ R[0,0], G[0,0], B[0,0] ]
                #         First entry, second line:  [ R[0,1], G[0,1], B[0,1] ]
                #         ...
                #         Second entry, first line:  [ R[1,0], G[1,0], B[1,0] ]
                #         ....
                # Stack structure is: each entry of the stack is the tuple (R, G, B) of  
                # Axis value: 2 for channels last, -1 for channel first
                # From stack, to extract one channel (R, G or B): stack[:, :, n], n = 0 for R, 1, for G and 2 for B
                image = tf.stack([red, green, blue], axis = -1) # RGB channels last
                image = tf.squeeze(image)

                image = tf.image.convert_image_dtype(image, tf.float32)
                
                if p_label.dtype != 'string':
                    label = tf.convert_to_tensor(p_label, dtype = tf.int8)
                else:
                    label = tf.convert_to_tensor(p_label, dtype = tf.string)
                return image, label
            # End of _rebuild_image_from_channels

            # Sample elements from Tests and Validation dataset to reduce execution time
            if not IMAGES_SAMPLES_NUM is None: # Use a subset of images from the training Dataset
                # Shuffle row from the Training dataset
                train_df = random.sample(train_df, IMAGES_SAMPLES_NUM)
                train_df.reset_index(inplace = True);
                # Shuffle row from the Validation dataset
                validation_df = random.sample(validation_df, int((1 - p_train_size) * IMAGES_SAMPLES_NUM // p_train_size))
                validation_df.reset_index(inplace = True);
            else: # Use all images from the training Dataset
                # Nothing to do
                pass
            # Set sizes
            train_df_num_images = train_df.shape[0]
            validation_df_num_images = validation_df.shape[0]
            test_df_num_images = test_df.shape[0]
            # Build Tensoflow dataset for Training
            image_path = train_df['image_path'] # List of image paths
            labels = np.array(train_df['class'].values.tolist()).astype(np.int8) # List of labels
            train_df = tf.data.Dataset.from_tensor_slices((image_path, labels))
            # Shuffle the training dataset
            train_df = train_df.shuffle(len(image_path) + 1)  # buffer_size >= dataset length
            train_df = train_df.map(_rebuild_image_from_channels, num_parallel_calls = tf.data.AUTOTUNE)
            # Build Tensoflow dataset for Validation
            image_path = validation_df['image_path'] # List of image paths
            labels = np.array(validation_df['class'].values.tolist()).astype(np.int8) # List of labels
            validation_df = tf.data.Dataset.from_tensor_slices((image_path, labels))
            # Shuffle the validation dataset
            validation_df = validation_df.shuffle(len(image_path) + 1)  # buffer_size >= dataset length
            validation_df = validation_df.map(_rebuild_image_from_channels, num_parallel_calls = tf.data.AUTOTUNE)
            # Build Tensoflow dataset for Validation
            image_path = test_df['image_path'] # List of image paths
            ids = np.array(test_df['id'].values.tolist()) # List of IDs
            test_df = tf.data.Dataset.from_tensor_slices((image_path, ids))
            # Shuffle the test dataset
            test_df = test_df.shuffle(len(image_path) + 1)  # buffer_size >= dataset length
            test_df = test_df.map(_rebuild_image_from_channels, num_parallel_calls = tf.data.AUTOTUNE)
        #elif ML_NAME == 'BMS-MolecularTranslation':
        #    pass
        else:
            raise Exception('kaggle_load_images_datasets', 'Unsupported ML_NAME: %s' % ML_NAME)
    elif ML_NAME == 'FlowerClassification':
        print('kaggle_load_images_datasets: Processing compete %s:' % ML_NAME)
        # Load file names for Train, Validation and Test images folders
        train_files = tf.io.gfile.glob(os.path.join(p_global_path, p_train_path))
        # Sample elements from Tests and Validation dataset to reduce execution time
        # FIXME How to sample Tensorflow datasets?
        #if not IMAGES_SAMPLES_NUM is None: # Use a subset of images from the training Dataset
        #    # Shuffle row from the Training dataset
        #    train_files = random.sample(train_files, IMAGES_SAMPLES_NUM)
        #    train_files.reset_index(inplace = True);
        #else: # Use all images from the training Dataset
        #    # Nothing to do
        #    pass
        # Disabling order increases speed
        ignore_order = tf.data.Options()
        if not p_ordered:
            ignore_order.experimental_deterministic = False # disable order, increase speed
        # Build the dataset with the images
        train_df = tf.data.TFRecordDataset(train_files, num_parallel_reads = tf.data.experimental.AUTOTUNE)
        # To use data as soon as it streams in
        train_df = train_df.with_options(ignore_order)
        # Decode tfrecord images into (jpeg, label)
        train_df = train_df.map(kaggle_read_labeled_tfrecord, num_parallel_calls = tf.data.experimental.AUTOTUNE)
        print('kaggle_load_images_datasets: Training data shapes:', train_df.cardinality().numpy())
        for image, label in train_df.take(3):
            print('kaggle_load_images_datasets: Training data label examples:', label.numpy())
            print(image.numpy().shape, len(image.numpy()), label.numpy().shape)
        train_df_num_images = kaggle_dataset_size(train_files)
        if not p_validation_path is None:
            validation_files = tf.io.gfile.glob(os.path.join(p_global_path, p_validation_path))
            # Sample elements from Tests and Validation dataset to reduce execution time
            # FIXME How to sample Tensorflow datasets?
            #if not IMAGES_SAMPLES_NUM is None: # Use a subset of images from the training Dataset
            #    # Shuffle row from the Validation dataset
            #    validation_df = random.sample(validation_df, int((1 - p_train_size) * IMAGES_SAMPLES_NUM // p_train_size))
            #    validation_df.reset_index(inplace = True);
            #else: # Use all images from the validation Dataset
            #    # Nothing to do
            #    pass
            # Build the dataset with the images
            validation_df = tf.data.TFRecordDataset(validation_files, num_parallel_reads = tf.data.experimental.AUTOTUNE)
            # To use data as soon as it streams in
            validation_df = validation_df.with_options(ignore_order)
            # Decode tfrecord images into (jpeg, label)
            validation_df = validation_df.map(kaggle_read_labeled_tfrecord, num_parallel_calls = tf.data.experimental.AUTOTUNE)
            print('kaggle_load_images_datasets: Validation data shapes:', validation_df.cardinality().numpy())
            for image, label in validation_df.take(3):
                print('kaggle_load_images_datasets: Validation data label examples:', label.numpy())
                print(image.numpy().shape, len(image.numpy()), label.numpy().shape)
            validation_df_num_images = kaggle_dataset_size(validation_files)
        if not p_test_path is None:
            test_files = tf.io.gfile.glob(os.path.join(p_global_path, p_test_path))
            print('----------------------------- kaggle_load_images_datasets: test_files')
            # Build the dataset with the images
            test_df = tf.data.TFRecordDataset(test_files, num_parallel_reads = tf.data.experimental.AUTOTUNE)
            # To use data as soon as it streams in
            test_df = test_df.with_options(ignore_order)
            # Decode tfrecord images into (jpeg, label)
            test_df = test_df.map(kaggle_read_unlabeled_tfrecord, num_parallel_calls = tf.data.experimental.AUTOTUNE)
            print('kaggle_load_images_datasets: Test data shapes:', test_df.cardinality().numpy())
            test_df_num_images = kaggle_dataset_size(test_files)

    train_df, validation_df, train_df_num_images, validation_df_num_images = kaggle_post_load_images_datasets(train_df, validation_df, train_df_num_images, validation_df_num_images)

    print('kaggle_load_images_datasets: Done: %s' % (p_train_url if not p_train_url is None else p_train_path))
    return train_df, validation_df, test_df, train_df_num_images, validation_df_num_images, test_df_num_images
    # End of function kaggle_load_images_datasets

# Learning from data

Examining the dataset means get a global overview of its data from statistical point of view, using:
1. Some basics statistical tools such as means, stds, quartiles, skewness and correlation (2.a)
2. Some visualization tools such as histograms, density plots (point 2.b.1) or Images display (point 2.b.2)

Some additional helpers functions are also provided.


Understanding the data is the most important step. The kaggle_summurize_data() function provides you a lot of information to help you in this task:
- Dataset info: It provides information about the structure of the data:
1) The number of features (or attributes or columns), and the name (or label) of each. Here, it is important to understand what each feature means, what can be the values for this feature, take care of the units... A lot of research work to understand our problem,
2) The types of each feature. 'object' type indicates categorical features, it means we should have to do some imputations,
3) One or several of these feature will be our ML output and some of them could be removed later because of poor interest to solve our problem (e.g. features with huge correlation, feature reduction using ACP...),
3) The number of observations (or samples) in the dataset. This will be useful to split our datatset into training, validation and test dataset.
- Dataset columns labels: It indicates the name (or label) of each attributes
- Means: It provides you the mean value for each features (also provided by statistical abstract, see below)
- Dataset statistical abstract: It provides, for each feature, basic statistical metrics such as means, stds, quartiles...
- Dataset Head: It displays the fisrt samples of the dataset. It provides you some indication of the value of each observation. Note that it is not suffisient to detect specific values such as NULL or NaN values, zeros, string values, categorical values... 
- Unique possible columns: It provides, for each feature, the list of the unique values. This will help you during the data transformation to rescale and center the feature values (see point 3.c). Very often, a feature with few unique values (e.g. 2 or 3) indicates also a categorical fetaure,
- Correlation table: It provides the correlation between all couple of features and the list of the correlation values in the range > 0.7 and < -0.7. The will be used to reduce the number of features due to strong link between some features (see p_correlation_threshold parameter)
- Skewness Reduction: It provides, for each feature, indicators about [Skewness and Kurtosis](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa) 

Note: Here we use pandas_profiling to generate an analyze report in HTML format. This report is higly valuable because of the information it provides for each columns:
1. Specific value indicators such as zeros, NaN...
2. Distincts values
3. Statistical values such as mean, min/max...

In [None]:
# 2. Summarize the dataset content in a statistical form
# a) Descriptive statistics
def kaggle_summurize_data(p_df: pd.core.frame.DataFrame, p_correlation_threshold: float = 0.7) -> None:
    """
    This function provides a statistical view of the current dataset
    :parameters p_df: The dataset handle
    """
    print('----------------------------- kaggle_summurize_data -----------------------------')
    # General information
    print('Dataset info:')
    print(p_df.info())
    print('----------------------------- kaggle_summurize_data: Dataset columns labels:')
    print(p_df.columns)
    print('----------------------------- kaggle_summurize_data: Means:')
    print(p_df.mean())
    print('----------------------------- kaggle_summurize_data: Dataset statistical abstract:')
    print(p_df.describe().T)
    print('----------------------------- kaggle_summurize_data: Dataset Head:')
    print(p_df.head(20))
    # NaN values
    print('----------------------------- kaggle_summurize_data: NaN values distribution:')
    print(p_df.isnull().sum().sort_values(ascending = False))
    print("----------------------------- kaggle_summurize_data: Number of rows with NaN: ", p_df.isnull().any(axis = 1).sum())
    # Zeros per columns
    print('----------------------------- kaggle_summurize_data: Zeros per columns distribution:')
    for column in p_df.columns:
        if p_df[column].dtype == 'int64' or p_df[column].dtype == 'float64':
            zeros = p_df[column].isin([0]).sum()
            s = p_df[column].sum()
            print('{}: {}'.format(column, zeros, 100 * zeros / s))
        else:
            print('%s: Not numerical column' % column)
    # End of 'for' statement
    # Distribution of categorical features
    print('----------------------------- kaggle_summurize_data: Distribution of categorical features:')
    categorical_columns = [col for col in p_df.columns if p_df[col].dtype == 'object']
    for c in categorical_columns:
        print('Distribution  for %s' % c)
        print(p_df[c].describe())
    # End of 'for' statement
    # Distribution of categorical features
    print('----------------------------- kaggle_summurize_data: Distribution of numerical features:')
    numerical_columns = [col for col in p_df.columns if p_df[col].dtype == 'int64' or p_df[col].dtype == 'float64']
    for c in numerical_columns:
        print('Distribution  for %s' % c)
        print(p_df[c].describe())
    # End of 'for' statement
    #  Calculate skew and kurt 
    print('----------------------------- kaggle_summurize_data: Skewness Reduction')
    for c in numerical_columns:
        print('Skew of %s: %f' % (c, p_df[c].skew()))
        print('Kurt of %s: %f' % (c, p_df[c].kurt()))
    # End of 'for' statement
    #  Unique possible columns
    print('----------------------------- kaggle_summurize_data: Unique possible columns:')
    for c in p_df.columns:
        print('{}: {}'.format(c, len(p_df[c].unique())))
    # End of 'for' statement
    # Build Correlation matrix
    print('----------------------------- kaggle_summurize_data: Correlation table:')
    print(p_df.corr(method = 'pearson'))
    # Extract correlation > 0.7 and < -0.7
    print('----------------------------- kaggle_summurize_data: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    corr = p_df.corr().unstack().reset_index() # Group together pairwise
    corr.columns = ['var1', 'var2', 'corr'] # Rename columns to something readable
    print(corr[ (corr['corr'].abs() > p_correlation_threshold) & (corr['var1'] != corr['var2']) ] )
    # Finally, create Pandas Profiling
    #print('----------------------------- kaggle_summurize_data: Pandas Profiling:')
    #file = ProfileReport(p_df) # Need to many times
    #file.to_file('./eda.html')
    #file.to_notebook_iframe()
    print('kaggle_summurize_data: Done')
    # End of function kaggle_summurize_data

The functions below are some helpers for data visualization (see b.1) Data visualizations for more details).

In [None]:
def create_grid(p_df: pd.core.frame.DataFrame, p_features:list = None, p_nun_plot_per_line:int = 3) -> list:
    """
    Create the grid in preparation of the plots
    :parameters p_df: The dataset handle
    :parameters p_features: The features to concider for the plot. Default: None, all the features are considered
    :parameters p_nun_plot_per_lane: The number of plot per line. Default: 3
    """
    # Create figure
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    sns.set_style('darkgrid')
    l = len(features) // p_nun_plot_per_line + (1 if len(features) % p_nun_plot_per_line != 0 else 0)
    fig = plt.figure(figsize = (l * 5, p_nun_plot_per_line * 5))
    gs = fig.add_gridspec(l, p_nun_plot_per_line)
    gs.update(wspace = 0.1, hspace = 0.4)
    background_color = '#fbfbfb'
    # Prepare the grid
    fig_desc = dict()
    run_no = 0
    for row in range(0, l):
        for col in range(0, p_nun_plot_per_line):
            fig_desc['ax' + str(run_no)] = fig.add_subplot(gs[row, col])
            fig_desc['ax' + str(run_no)].set_facecolor(background_color)
            fig_desc['ax' + str(run_no)].tick_params(axis = 'y', left = True)
            fig_desc['ax' + str(run_no)].get_yaxis().set_visible(True)
            for s in ['top', 'right']:
                fig_desc['ax' + str(run_no)].spines[s].set_visible(False)
            run_no += 1
        # End of 'for' statement
    # End of 'for' statement
    
    return (fig, gs, fig_desc)
    # End of function create_grid

def finalize_grid(p_figure_desc: list, p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    """
    Finalize the grid after the plot
    :parameters p_df: The dataset handle
    :parameters p_features: The features to concider for the plot. Default: None, all the features are considered
    :parameters p_title: The title of the figure
    :parameters p_comment: An additional comment to add to the figure
    """
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    fig, gs, fig_desc = p_figure_desc
    # Add Titles
    fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')
    # Cleanup unused plots
    for t in range(len(features), len(fig_desc)):
        for s in ['top', 'bottom', 'right', 'left']:
            fig_desc['ax' + str(t)].spines[s].set_visible(False)
        fig_desc['ax' + str(t)].tick_params(axis='x', bottom = False)
        fig_desc['ax' + str(t)].get_xaxis().set_visible(False)
        # End of 'for' statement

    plt.show()

    fig = None
    gs = None
    fig_desc = None
    # End of function finalize_grid

def show_counts(p_df: pd.core.frame.DataFrame, p_features:list = None, p_hue:str = None, p_title:str = None, p_comment:str = None) -> None:
    """
    """
    print('----------------------------- show_counts -----------------------------')
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        sns.countplot(p_df[feature], hue = p_hue, ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_distributions

def show_modes(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_modes -----------------------------')
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        try:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], hist = False, color='#ffd100')
        except RuntimeError:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], kde = False, hist = False, color='#ffd100')            
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_modes

def show_distributions(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_distributions -----------------------------')
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        try:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        except RuntimeError:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], kde = False, color='#ffd100')    
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_distributions

def show_scatter_plots(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_scatter_plots -----------------------------')
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        if p_df[feature].dtype == 'object':
            sns.catplot(x = feature, data = p_df, kind = 'count', ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        else:
            sns.catplot(x = feature, data = p_df, kind = 'swarm', ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_scatter_plots

def show_trends(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_trends -----------------------------')
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        sns.lineplot(data = p_df[feature], ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_trends

def show_correlations(p_df: pd.core.frame.DataFrame, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_correlations -----------------------------')
    # Create the grid
    fig = plt.figure(figsize = (10, 10))
    gs = fig.add_gridspec(1, 1)
    background_color = "#fbfbfb"
    # Prepare the grid
    fig_desc = dict()
    fig_desc['ax0'] = fig.add_subplot(gs[0, 0])
    fig_desc['ax0'].set_facecolor(background_color)
    fig_desc['ax0'].tick_params(axis = 'y', left=False)
    fig_desc['ax0'].get_yaxis().set_visible(False)
    for s in ["top", "right", "left"]:
        fig_desc['ax0'].spines[s].set_visible(False)
    # Draw plots
    sns.heatmap(data = p_df.corr(), annot=True)
    # Finalyze the figure
    # Add Titles & Comments
    if not p_title is None:
        fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    if not p_comment is None:
        fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')   
    plt.show()
    # End of function show_correlations

def show_outliers(p_df: pd.core.frame.DataFrame, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_outliers -----------------------------')
    features = p_df.columns
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        ds = p_df[feature].value_counts()
        sns.boxplot(ds, ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_outliers

def show_features_vs_target(p_df: pd.core.frame.DataFrame, p_target:str, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    print('----------------------------- show_features_vs_target -----------------------------')
    if p_features is None:
        features = p_df.columns.tolist() # Using tolist() for removing p_target
    else:
        features = p_features
    if p_target in features:
        features.remove(p_target)
    # Draw plots
    for feature in features:
        sns.relplot(x = p_target, y = feature, data = p_df, col = p_target, color='#ffd100')
        # End of 'for' statement
    # End of function show_features_vs_target

def show_pair_plot_vs_target(p_df: pd.core.frame.DataFrame, p_target:str, p_title:str = None, p_comment:str = None):
    print('----------------------------- show_pair_plot_vs_target -----------------------------')
    # Create the grid
    fig = plt.figure(figsize = (12, 12))
    gs = fig.add_gridspec(1, 1)
    background_color = "#fbfbfb"
    # Prepare the grid
    fig_desc = dict()
    fig_desc['ax0'] = fig.add_subplot(gs[0, 0])
    fig_desc['ax0'].set_facecolor(background_color)
    fig_desc['ax0'].tick_params(axis = 'y', left=False)
    fig_desc['ax0'].get_yaxis().set_visible(False)
    for s in ["top", "right", "left"]:
        fig_desc['ax0'].spines[s].set_visible(False)
    # Draw plots
    sns.pairplot(p_df, hue = p_target)
    # Finalyze the figure
    # Add Titles & Comments
    if not p_title is None:
        fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    if not p_comment is None:
        fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')   
    plt.show()
    # End of function show_pair_plot_vs_target

def show_mutual_information(p_df: pd.core.frame.DataFrame, p_title:str = None, p_comment:str = None, p_seed: int = SEED_HARCODED_VALUE) -> None:
    print('----------------------------- show_mutual_information -----------------------------')
    X = p_df.copy() # Do not modify input parameter
    y = X.pop(TARGET_COLUMNS)
    for c in X.select_dtypes("object"): # Label encoding for categoricals
        X[c], _ = X[c].factorize()
    # Now all categorical features were transformed into numerical features, so discrete_features = True
    if OUTPUT_IS_REGRESSION:
        mi_scores = mutual_info_regression(X, y, discrete_features = True, random_state = p_seed)
    else:
        mi_scores = mutual_info_classif(X, y, discrete_features = True, random_state = p_seed)
    mi_scores = pd.Series(mi_scores, name = 'MI Scores', index = X.columns)
    mi_scores = mi_scores.sort_values(ascending = False)
    print('show_mutual_information: MI Scores: ')
    print(mi_scores)
    # Create the grid
    fig = plt.figure(figsize = (8, 8))
    gs = fig.add_gridspec(1, 1)
    background_color = "#fbfbfb"
    # Prepare the grid
    fig_desc = dict()
    fig_desc['ax0'] = fig.add_subplot(gs[0, 0])
    fig_desc['ax0'].set_facecolor(background_color)
    #fig_desc['ax0'].tick_params(axis = 'y', left=False)
    #fig_desc['ax0'].get_yaxis().set_visible(False)
    for s in ["top", "right", "left"]:
        fig_desc['ax0'].spines[s].set_visible(False)
    # Draw plots
    sns.barplot(mi_scores.values, mi_scores.index, orient = 'h')
    # Finalyze the figure
    # Add Titles & Comments
    if not p_title is None:
        fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    if not p_comment is None:
        fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')   
    plt.show()
    # End of function show_mutual_information

def cross_dataset_visualization(p_dfs: list, p_title:str = None, p_features:list = None, p_comment:str = None) -> None:
    """
    This method plots compared features distribution according to the list (e.g. Train and Validation features)
    """
    print('----------------------------- cross_dataset_visualization -----------------------------')
    if p_features is None:
        features = p_dfs[0].columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_dfs[0][features])
    # Draw plots
    run_no = 0
    for feature in features:
        sns.histplot(p_dfs[0][feature].value_counts(), ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        sns.histplot(p_dfs[1][feature].value_counts(), ax = fig_desc['ax' + str(run_no)], color='#ff0100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_dfs[0], features, p_title, p_comment)
    # End of function cross_dataset_visualization

The kaggle_visualization() function provides different plot to explore the data distrubution (gaussian, exponecial...) and to detect outlier values. It will help 1) during the data cleaning and 2) later, to choose the ML algorithms (e.g. Outliers do not affect a tree-based algorithm).
There are two kind of data visualition:
- The Univariate Plots which are related to each features, and
- The Multivariate Plots which are related to interaction between features

The Univariate Plots:
- Histograms: It provides a graphical representation of the distribution of a dataset. For a continuous numerical, it show the underlying frequency distribution or the probability  distribution of signal (see https://towardsdatascience.com/histograms-why-how-431a5cfbfcd5)
- Density: It is the continuous form of the histogram (see above) and it shows an estimate of the continuous distribution of a feature (Gaussian distribution, exponential distribution...)

The Multivariate Plots
- Correlationan: It provides indications about the changes between two features
- scatter_matrix: It shows how much one feature is affected by another or the relationship between them

In [None]:
# b.1) Data visualizations
def kaggle_visualization(p_df: pd.core.frame.DataFrame, p_seed: int = SEED_HARCODED_VALUE) -> None:
    """
    This method provides different views of the dataset (plot)
    :parameters p_df: The dataset handle
    :parameter p_seed: The seed value for reproductibility (TODO Not used)
    """
    print('----------------------------- kaggle_visualization_data -----------------------------')
    features = list(set(p_df.columns) - set(TARGET_COLUMNS))
    categorical_columns = [col for col in p_df.columns if p_df[col].dtype == 'object']
    if not DATE_TIME_COLUMNS is None:
        categorical_columns = list(set(categorical_columns) - set(DATE_TIME_COLUMNS))
    numerical_columns = [col for col in p_df.columns if p_df[col].dtype == 'int64' or p_df[col].dtype == 'float64']
    print('kaggle_visualization: Features Distribution plots')
    show_counts(p_df, p_hue = None)
    # Histogram plots
    print('kaggle_visualization: Numerical features Distribution plots')
    show_distributions(p_df, p_features = numerical_columns)
    if len(categorical_columns) != 0:
        print('kaggle_visualization: Categorical features scatter plots')
        show_scatter_plots(p_df, p_features = categorical_columns)
    else:
        print('kaggle_visualization: No categorical features to plots')
    print('kaggle_visualization: Features outliers plots')
    show_outliers(p_df, p_title = 'Features Distribution')
    #show_trends(p_df, p_title = 'Features Distribution', p_comment = 'All features have bimodal or multimodal distribution')
    print('kaggle_visualization: Histogram of each attributes regarding targets')
    show_correlations(p_df)
    print('kaggle_visualization: Features VS target distribution plots')
    show_features_vs_target(p_df, p_target = TARGET_COLUMNS, p_features = numerical_columns)    
    print('kaggle_visualization: Features VS target distribution plots')
    show_pair_plot_vs_target(p_df, p_target = TARGET_COLUMNS)
    print('kaggle_visualization: Done')
    # End of function kaggle_visualization

In case of images, we need to show them to understand the data and the problem to solve (e.g. flowers classification...).
This the role of the functions below.

In [None]:
# b.2) Images visualization
def kaggle_image_visualization(p_dataset, p_num_images: int = 10, p_prediction = None):
    """
    This method show a number of images from p_dataset
    :parameter p_dataset: The dataset containing the images to show. Note the dataset is usually already 'batch'
    :parameter p_num_images: The number of images to show
    :parameter p_prediction: 
    """
    print('----------------------------- kaggle_image_visualization -----------------------------')
    # Peek some data from the dataset
    it = iter(p_dataset.unbatch().batch(p_num_images))
    batch = next(it)
    kaggle_display_batch(batch, p_prediction)
    # End of function kaggle_image_visualization

def kaggle_batch_to_numpy(p_dataset):
    print('----------------------------- kaggle_batch_to_numpy -----------------------------')
    images, labels = p_dataset
    numpy_images = images.numpy()
    numpy_labels = labels.numpy()
    if numpy_labels.dtype == object: # binary string in this case,
                                     # these are image ID strings
        numpy_labels = [None for _ in enumerate(numpy_images)]
    # If no labels, only image IDs, return None for labels (this is
    # the case for test data)
    return numpy_images, numpy_labels
    # End of function kaggle_batch_to_numpy

def kaggle_title_from_label(p_label: str, p_correct_label: str):
    print('----------------------------- kaggle_title_from_label -----------------------------')
    if p_correct_label is None:
        return IMAGE_CLASSES[p_label], True
    correct = (p_label == p_correct_label)
    return "{} [{}{}{}]".format(IMAGE_CLASSES[p_label], 'OK' if correct else 'NO', u"\u2192" if not correct else '',
                                IMAGE_CLASSES[p_correct_label] if not correct else ''), correct
    # End of function kaggle_title_from_label

def kaggle_display_batch_image(p_image, p_title:str, subplot, red = False, titlesize = 16, p_cmap = 'viridis'):
    plt.subplot(*subplot)
    plt.axis('off')
    plt.imshow(p_image, cmap = p_cmap)
    if len(p_title) > 0:
        plt.title(p_title, fontsize=int(titlesize) if not red else int(titlesize/1.2), color='red' if red else 'black', fontdict={'verticalalignment':'center'}, pad=int(titlesize/1.5))
    return (subplot[0], subplot[1], subplot[2]+1)
    # End of function kaggle_display_batch_image
    
def kaggle_display_batch(p_databatch, p_predictions = None):
    """
    """
    print('----------------------------- kaggle_display_batch -----------------------------')
    # Images
    images, labels = kaggle_batch_to_numpy(p_databatch)
    if labels is None: # Fill with None
        labels = [None for _ in enumerate(images)] 
    # Auto-squaring: this will drop data that does not fit into square or square-ish rectangle
    rows = int(math.sqrt(len(images)))
    cols = len(images)//rows
    # Sizing and spacing
    FIGSIZE = 13.0
    SPACING = 0.1
    subplot=(rows, cols, 1)
    if rows < cols:
        plt.figure(figsize=(FIGSIZE,FIGSIZE / cols * rows))
    else:
        plt.figure(figsize=(FIGSIZE / rows * cols,FIGSIZE))
    # Display
    for i, (image, label) in enumerate(zip(images[:rows * cols], labels[:rows * cols])):
        title = 'No label' if label is None or type(label) != 'int' else IMAGE_CLASSES[label]
        correct = True
        if not p_predictions is None:
            title, correct = kaggle_title_from_label(p_predictions[i], label)
        dynamic_titlesize = FIGSIZE * SPACING / max(rows, cols) * 40 + 3 # magic formula tested to work from 1x1 to 10x10 images
        subplot = kaggle_display_batch_image(image, title, subplot, not correct, titlesize=dynamic_titlesize)
    # Layout
    plt.tight_layout()
    if label is None and p_predictions is None:
        plt.subplots_adjust(wspace=0, hspace=0)
    else:
        plt.subplots_adjust(wspace=SPACING, hspace=SPACING)
    plt.show()
    # End of function kaggle_display_batch

def kaggle_display_image_and_component(p_image, p_title:str = None) -> None:
    """
    This function displays an image and its RGB components separatly
    :parameter p_image: The image to display (RGB format)
    :parameter p_title: The title of the display
    """
    # Extract RGB components
    pixels = img_to_array(p_image)
    red = pixels[:, :, 0]
    green = pixels[:, :, 1]
    blue = pixels[:, :, 2]
    # Sizing and spacing
    FIGSIZE = 13.0
    SPACING = 0.1
    rows = 1
    cols = 4
    subplot=(rows, cols, 1)
    if rows < cols:
        plt.figure(figsize=(FIGSIZE,FIGSIZE / cols * rows))
    else:
        plt.figure(figsize=(FIGSIZE / rows * cols,FIGSIZE))
    # Display image and its components
    images = (pixels, red, green, blue)
    titles = (p_title, 'red component', 'green component', 'blue component')
    for i, (image, title) in enumerate(zip(images, titles)):
        dynamic_titlesize = FIGSIZE * SPACING / max(rows, cols) * 40 + 3 # magic formula tested to work from 1x1 to 10x10 images
        subplot = kaggle_display_batch_image(image, title, subplot, True, titlesize = dynamic_titlesize, p_cmap = 'viridis' if i == 0 else 'gray')    
    # Layout
    plt.tight_layout()
    plt.subplots_adjust(wspace=0, hspace=0)
    plt.show()
    # End of function kaggle_display_image_and_component

def kaggle_display_confusion_matrix(cmat, score, precision, recall) -> None:
    """
    """
    print('----------------------------- kaggle_display_confusion_matrix -----------------------------')
    plt.figure(figsize=(15,15))
    ax = plt.gca()
    ax.matshow(cmat, cmap='Reds')
    ax.set_xticks(range(len(IMAGE_CLASSES)))
    ax.set_xticklabels(IMAGE_CLASSES, fontdict={'fontsize': 7})
    plt.setp(ax.get_xticklabels(), rotation=45, ha="left", rotation_mode="anchor")
    ax.set_yticks(range(len(IMAGE_CLASSES)))
    ax.set_yticklabels(IMAGE_CLASSES, fontdict={'fontsize': 7})
    plt.setp(ax.get_yticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    titlestring = ""
    if score is not None:
        titlestring += 'f1 = {:.3f} '.format(score)
    if precision is not None:
        titlestring += '\nprecision = {:.3f} '.format(precision)
    if recall is not None:
        titlestring += '\nrecall = {:.3f} '.format(recall)
    if len(titlestring) > 0:
        ax.text(101, 1, titlestring, fontdict={'fontsize': 18, 'horizontalalignment':'right', 'verticalalignment':'top', 'color':'#804040'})
    plt.show()
    print('kaggle_display_confusion_matrix: Done')
    # End of function kaggle_display_confusion_matrix

def kaggle_display_learning_curves(training, validation, title, subplot) -> None:
    print('----------------------------- kaggle_display_learning_curves -----------------------------')
    if subplot % 10 == 1: # set up the subplots on the first call
        plt.subplots(figsize = (10,10), facecolor = '#F0F0F0')
        plt.tight_layout()
    ax = plt.subplot(subplot)
    ax.set_facecolor('#F8F8F8')
    ax.plot(training)
    ax.plot(validation)
    ax.set_title('model '+ title)
    ax.set_ylabel(title)
    #ax.set_ylim(0.28,1.05)
    ax.set_xlabel('epoch')
    ax.legend(['train', 'valid.'])
    plt.show()
    print('kaggle_display_learning_curves: Done')
    # End of function kaggle_display_learning_curves

Now, we need to do a break and we need:
1. to understand exactly what each column is?
2. to learn from the results we got


# Iris classification



# Titanic Desaster



# PIMA diabetes
Let's take a look to the number of '0' value in each columns (see kaggle_summurize_data/Zeros per columns distribution and features distributions) and try to understand what does it mean? 
- What does it mean a Glucose (plass) or a Blood pressure (pres) value of 0? It's not possible!!!
- Idem for Skin thickness

In this case, one solution is to replace '0' by a NaN value and the Impute process will do the job ;)

This is done in kaggle_pre_features_engineering() function where 0 values are replaced by NaN values for the following features: 'plas', 'pres', 'skin', 'test' and 'mass'

# Flower classification using TPU
Here the problem is to classify a photo of a flower in the right category (see IMAGE_CLASSES).
The datasets (training and validation) are in Tensorflow tfrec format (see labeled_image_feature_description) and the test dataset (having the role of unseen data) is also in tfrec format (see unlabeled_image_feature_description).


*TODO: Add data analyze results for other dataset (Iris, Melbourne Housing Prices)*


The function kaggle_ml_quick_and_dirty() provides a 'quick and dirty' evaluation of a ML based on RandomForestClassifier algorithm with estimators parameter set to 128. All rows with NaN values are removed and all categorical attributes are excluded.

The idea is to use this function to establish a baseline by training a basic model at different steps of the Feature engineering (such as un-augmented dataset, data cleanup...). A baseline score can help us to decide whether the different processings are actually useful or worth.

In [None]:
# c.1) Basic ML for a quick & dirty evaluation
def kaggle_ml_quick_and_dirty(p_train_df: pd.core.frame.DataFrame, 
                              p_validation_df: pd.core.frame.DataFrame, 
                              p_test_df: pd.core.frame.DataFrame = None,
                              p_test_outputs_df: pd.core.frame.DataFrame = None,
                              p_seed:int = SEED_HARCODED_VALUE
                             ) -> np.ndarray:
    """
    This method provides a first ML evalulation based on RandomForest algorithm
    :parameters p_train_df: The Training dataset (to fit the model)
    :parameters p_validation_df: The Training dataset (to validate the model)
    :parameters p_test_df: The Test dataset (to do prediction on unseen data). Default: None
    :parameters p_test_outputs_df: The outputs for the Test dataset. It will be None in case of Kaggle compete. Default: None
    :parameter p_seed: The seed value for reproductibility
    :return: The machine learning model  
    """
    print('----------------------------- kaggle_ml_quick_and_dirty -----------------------------')

    l = LabelEncoder()
    # Build training & validation datasets
    p = p_train_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    if p[TARGET_COLUMNS].dtype == 'object':
        # Do basic imputation
        p[TARGET_COLUMNS] = l.fit_transform(p[TARGET_COLUMNS])
    p = p.select_dtypes(exclude=['object'])
    Y_train = p[TARGET_COLUMNS]
    X_train = p.drop([TARGET_COLUMNS], axis = 1)
    print('X_train =>', X_train.head())
    print('Y_train =>', Y_train.head())
    print('type(Y_train) =>', type(Y_train))

    p = p_validation_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    if p[TARGET_COLUMNS].dtype == 'object':
        # Do basic imputation
        p[TARGET_COLUMNS] = l.fit_transform(p[TARGET_COLUMNS])
    p = p.select_dtypes(exclude=['object'])
    Y_validation = p[TARGET_COLUMNS]
    X_validation = p.drop([TARGET_COLUMNS], axis = 1)
    print('X_validation =>', X_validation.head())
    print('Y_validation =>', Y_validation.head())
    print('type(Y_validation) =>', type(Y_validation))

    # Use classical model
    model = None
    if OUTPUT_IS_REGRESSION:
        model = ensemble.RandomForestRegressor(n_estimators = 128, max_depth = 16, max_features = 4, random_state = p_seed)
    else:
        model = ensemble.RandomForestClassifier(n_estimators = 128, max_depth = 16, max_features = 4, random_state = p_seed)
    # Train the model
    model.fit(X_train, Y_train.ravel())
    # Do predictions
    y_predictions = model.predict(X_validation)
    # Get scoring
    get_scoring(Y_validation, y_predictions)
    
    # Do prediction with unseen data
    if not p_test_df is None:
        p = p_test_df.copy()
        # Remove NaN values
        p.dropna(axis = 0, inplace = True) # Buggy, because #rows X_test will be different than #rows Y_test, try Impute or drop same rows in both X_test & Y_test
        # Ignore categorical values
        p = p.select_dtypes(exclude=['object'])
        y_predictions = model.predict(p)
        # Evaluate the results when it is possible
        if not p_test_outputs_df is None:
            print('kaggle_ml_quick_and_dirty: p shape: ', p.shape)
            print('kaggle_ml_quick_and_dirty: p_test_outputs_df: %s' % str(p_test_outputs_df.shape))
            print('kaggle_ml_quick_and_dirty: y_predictions: %s' % str(y_predictions.shape))
            # Get scoring
            get_scoring(p_test_outputs_df, y_predictions)

    print('kaggle_ml_quick_and_dirty: Done')
    return model
    # End of function kaggle_ml_quick_and_dirty

def get_scoring(p_expected_validation: np.array, p_predictions: np.array) -> None:
    """
    Display scores of the prediction
    :parameter p_expected_validation: The expected predictions
    :parameter p_predictions: The ML predictions
    """
    if OUTPUT_IS_REGRESSION:
        print('get_scoring: Model R2 score=%f' % (r2_score(p_expected_validation, p_predictions)))
        print('get_scoring: Model Mean absolute error regression loss (MAE): %0.4f' % mean_absolute_error(p_expected_validation, p_predictions))
        print('get_scoring: Model Mean squared error regression loss (MSE): %0.4f' % mean_squared_error(p_expected_validation, p_predictions))
        print('get_scoring: Mean squared error regression loss (RMSE): %0.4f' % np.sqrt(mean_squared_error(p_expected_validation, p_predictions)))
    else:
        print('get_scoring: Model accuracy score: %0.4f' % accuracy_score(p_expected_validation, p_predictions))
        print('get_scoring: ROC=%s' %(roc_auc_score(p_expected_validation, p_predictions)))
        print('get_scoring: Model F1 score=%f' % (f1_score(p_expected_validation, p_predictions)))
        print('get_scoring: Confusion matrix: %s' % str(confusion_matrix(p_expected_validation, p_predictions)))
        print('get_scoring: Classification report:\n%s' % str(classification_report(p_expected_validation, p_predictions)))


The set of functions below is specific to Neural Network learning. It provides callbacks to create DL models and functions to visualize learning curves or learning rate curves (see point c.2 below).

In [None]:
# Helper functions for kaggle_dl_quick_and_dirty()
def kaggle_build_pretrained_model(
                                  p_pretrained_layers:list, 
                                  p_image_size: list,
                                  p_channels: int = 3
                                  ) -> tf.keras.Sequential:
    """
    This function builds the pretrained models stacks for the Neural Network
    :parameter p_pretrained_layers: The pretrained models to be used
    :parameter p_image_size: The image size
    :parameter p_channels: The number of channels (e.g. 1 for grayscale, 3 for RGB)
    :return: The pretrained models stacks to build the final Neural Network
    """
    model = tf.keras.Sequential();
    for p in p_pretrained_layers:
        if p == 'InceptionV3': 
            # See https://github.com/fchollet/deep-learning-models/blob/ccd0eb24996b4cbff4231b90cd44b057c0b20f14/inception_v3.py
            pretrained_model = tf.keras.applications.InceptionV3(
                weights = 'imagenet',
                include_top = False,
                pooling = 'max',
                input_shape = [*p_image_size, p_channels]
            )
            pretrained_model.trainable = False
            model.add(pretrained_model)
        elif p == 'MobileNetV2':
            # See https://github.com/fchollet/deep-learning-models/blob/master/vgg19.py
            pretrained_model = tf.keras.applications.MobileNetV2(
                weights = 'imagenet',
                include_top = False,
                pooling = 'max',
                input_shape = [*p_image_size, p_channels]
            )
            pretrained_model.trainable = False
            model.add(pretrained_model)
        elif p == 'ResNet50':
            # See https://github.com/fchollet/deep-learning-models/blob/ccd0eb24996b4cbff4231b90cd44b057c0b20f14/resnet50.py
            pretrained_model = tf.keras.applications.ResNet50(
                weights = 'imagenet',
                include_top = False ,
                input_shape = [*p_image_size, p_channels]
            )
            pretrained_model.trainable = False
            model.add(pretrained_model)
        elif p == 'VGG16': 
            # See https://github.com/fchollet/deep-learning-models/blob/master/vgg16.py
            pretrained_model = tf.keras.applications.VGG16(
                weights = 'imagenet',
                include_top = False ,
                input_shape = [*p_image_size, p_channels]
            )
            pretrained_model.trainable = False
            model.add(pretrained_model)
        elif p == 'VGG19':
            # See https://github.com/fchollet/deep-learning-models/blob/master/vgg19.py
            pretrained_model = tf.keras.applications.VGG19(
                weights = 'imagenet',
                include_top = False,
                pooling = 'max',
                input_shape = [*p_image_size, p_channels]
            )
            pretrained_model.trainable = False
            model.add(pretrained_model)
        else:
            raise Exception('kaggle_build_pretrained_model', 'Undefined pretrained model')
        # End of 'for' statement
    return model
    # End of function kaggle_build_pretrained_model

def kaggle_create_sequential_classifier_model(
                                              #p_strategy
                                              #p_input_shape: int = DL_INPUT_SHAPE,
                                              #p_drop_rate: float = DL_DROP_RATE,
                                              p_optimizer: str = 'adam', 
                                              p_loss: str = 'binary_crossentropy', 
                                              p_metrics: list = ['AUC', 'accuracy'],
                                              p_pretrained_layers:list = None,
                                              p_image_size: list = None,
                                              p_class_num: int = None
                                             ) -> tf.keras.Sequential:
    """
    Build a Neural network model for classification
    """
    print('----------------------------- kaggle_create_sequential_classifier_model -----------------------------')
    global DL_INPUT_SHAPE, DL_DROP_RATE

    model = None
    if p_pretrained_layers is None:
        model = tf.keras.Sequential([
                                    tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
                                    tf.keras.layers.Dense(32, activation = 'relu'),
                                    tf.keras.layers.Dropout(rate = DL_DROP_RATE),
                                    tf.keras.layers.BatchNormalization(),
                                    tf.keras.layers.Dense(64, activation = 'relu'),
                                    tf.keras.layers.BatchNormalization(),
                                    tf.keras.layers.Dropout(rate = DL_DROP_RATE),
                                    tf.keras.layers.Dense(1, activation = 'sigmoid'), # Binary classes
                                    ])
    else:
        model = kaggle_build_pretrained_model(p_pretrained_layers, p_image_size)
        model.add(tf.keras.layers.Dense(len(IMAGE_CLASSES), activation='softmax')) # Multi classes

    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)
    # End of 'with' statement

    print('kaggle_create_sequential_classifier_model: Model summary:')
    model.summary()
    #plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
    return model
    # End of function kaggle_create_sequential_classifier_model

def kaggle_create_sequential_regressor_model(
                                             p_optimizer:str = 'adam', 
                                             p_loss:str = 'mae', 
                                             p_metrics:list = ['mae'],
                                             p_pretrained_layers: list = None,
                                             p_image_size: list = None,
                                             p_class_num: int = None
                                            ) -> tf.keras.Sequential:
    """
    Build a Neural network model for regression
    """
    print('----------------------------- kaggle_create_sequential_regressor_model -----------------------------')
    global DL_INPUT_SHAPE, DL_DROP_RATE

    model = None
    if p_pretrained_layers is None:
        model = tf.keras.Sequential([
                                    tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
                                    tf.keras.layers.Dense(32, activation = 'relu'),
                                    tf.keras.layers.Dropout(rate = DL_DROP_RATE),
                                    tf.keras.layers.BatchNormalization(),
                                    tf.keras.layers.Dense(64, activation = 'relu'),
                                    tf.keras.layers.BatchNormalization(),
                                    tf.keras.layers.Dropout(rate = DL_DROP_RATE),
                                    tf.keras.layers.Dense(1, activation = 'relu'),
                                    ])
    else:
        model = kaggle_build_pretrained_model(p_pretrained_layers, p_image_size);
        model.add(tf.keras.layers.GlobalAveragePooling2D())
        model.add(tf.keras.layers.Dropout(DL_DROP_RATE))
        model.add(tf.keras.layers.Dense(p_class_num, activation='relu'))

    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)

    print('kaggle_create_sequential_regressor_model: Model summary:')
    model.summary()
    return model
    # End of function kaggle_create_sequential_regressor_model

def kaggle_build_learning_rate_schedule(epoch,
                                        start_lr = 0.0001, 
                                        min_lr = 0.0001, 
                                        max_lr = 0.0005,
                                        rampup_epochs = 5, 
                                        sustain_epochs = 0,
                                        exp_decay = 0.8
                                       ):
    """
    """
    def _exponential_lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay):
        """
        """
        # linear increase from start to rampup_epochs
        if epoch < rampup_epochs:
            lr = ((max_lr - start_lr) /
                  rampup_epochs * epoch + start_lr)
        # constant max_lr during sustain_epochs
        elif epoch < rampup_epochs + sustain_epochs:
            lr = max_lr
        # exponential decay towards min_lr
        else:
            lr = ((max_lr - min_lr) *
                  exp_decay**(epoch - rampup_epochs - sustain_epochs) +
                  min_lr)
        return lr
    l = _exponential_lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay)
    return l
    # End of function kaggle_build_learning_rate_schedule

def kaggle_display_learning_rate_curve(p_lr):
    print('----------------------------- kaggle_display_learning_rate_curve -----------------------------')
    rng = [i for i in range(DL_EPOCH_NUM)]
    y = [p_lr(x) for x in rng]
    plt.plot(rng, y)
    plt.show()
    print("kaggle_display_learning_rate_curve: Done: Schedule: {:.3g} to {:.3g} to {:.3g}".format(y[0], max(y), y[-1]))
    # End of function kaggle_display_learning_rate_curve

The function kaggle_dl_quick_and_dirty() provides a 'quick and dirty' evaluation of a DL based on a basic Neural Network. All rows with NaN values are removed and all categorical attributes are excluded.

In [None]:
# c.2) Basic DL for a quick & dirty evaluation
def kaggle_dl_quick_and_dirty(p_train_df: pd.core.frame.DataFrame, 
                              p_validation_df: pd.core.frame.DataFrame, 
                              p_strategy,
                              p_test_df: pd.core.frame.DataFrame = None,
                              p_test_outputs_df: pd.core.frame.DataFrame = None,
                              p_seed:int = SEED_HARCODED_VALUE,
                             ) -> np.ndarray:
    """
    This method provides a first DL evalulation based on RandomForest algorithm
    :parameters p_train_df: The Training dataset (to fit the model)
    :parameters p_validation_df: The Training dataset (to validate the model)
    :parameters p_train_df: The Test dataset (to do prediction). Default: None
    :parameters p_test_outputs_df: The outputs for the Test dataset. It will be None in case of Kaggle compete. Default: None
    :parameters p_strategy: . Default: None
    :parameter p_seed: The seed value for reproductibility
    :return: The machine learning model  
    """

    print('----------------------------- kaggle_dl_quick_and_dirty -----------------------------')
    global DL_INPUT_SHAPE, DL_DROP_RATE

    l = LabelEncoder()
    # Build training & validation datasets
    p = p_train_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    if p[TARGET_COLUMNS].dtype == 'object':
        # Do basic imputation
        p[TARGET_COLUMNS] = l.fit_transform(p[TARGET_COLUMNS])
    p = p.select_dtypes(exclude=['object'])
    Y_train = p[TARGET_COLUMNS]
    X_train = p.drop([TARGET_COLUMNS], axis = 1)

    p = p_validation_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    if p[TARGET_COLUMNS].dtype == 'object':
        # Do basic imputation
        p[TARGET_COLUMNS] = l.fit_transform(p[TARGET_COLUMNS])
    p = p.select_dtypes(exclude=['object'])
    Y_validation = p[TARGET_COLUMNS]
    X_validation = p.drop([TARGET_COLUMNS], axis = 1)

    # Use classical model
    #   Use global valriables instead of parameters due to folling SKlearn issue
    #   KerasClassifier with build_fn a class does not work with sklearn.model_selection.cross_val_score #13717
    #   See https://github.com/keras-team/keras/issues/13717
    DL_INPUT_SHAPE = [X_train.shape[1]]
    with p_strategy.scope():
        if OUTPUT_IS_REGRESSION:
            model = KerasRegressor(build_fn = kaggle_create_sequential_regressor_model)#p_strategy, p_input_shape = [X_train.shape[1]])
        else:
            model = KerasClassifier(build_fn = kaggle_create_sequential_classifier_model)#p_strategy, p_input_shape = [X_train.shape[1]])
    with p_strategy.scope():
        # Train the model
        early_stopping = EarlyStopping(
            min_delta = 0.001, # minimium amount of change to count as an improvement
            patience = 20, # how many epochs to wait before stopping
            restore_best_weights = True,
        )
        history = model.fit(X_train, 
                            Y_train, 
                            validation_data=(X_validation, Y_validation), 
                            epochs = DL_EPOCH_NUM, 
                            batch_size = DL_BATCH_SIZE * p_strategy.num_replicas_in_sync, 
                            callbacks = [early_stopping],
                            verbose = 0
                           )
    # Draw Loss and AUC
    print('kaggle_dl_quick_and_dirty: Draw Loss, Accuracy and AUC')
    history_df = pd.DataFrame(history.history)
    print(history_df.head())
    print("kaggle_dl_quick_and_dirty: Minimum validation loss: {}".format(history_df['val_loss'].min()))
    kaggle_display_learning_curves(
        history.history['loss'],
        history.history['val_loss'],
        'loss',
        211,
    )
    print("kaggle_dl_quick_and_dirty: Maximum validation Accuracy: {}".format(history_df['val_accuracy'].max()))
    kaggle_display_learning_curves(
        history.history['auc'],
        history.history['val_auc'],
        'accuracy',
        212,
    )
    print("kaggle_dl_quick_and_dirty: Maximum validation AUC: {}".format(history_df['val_auc'].max()))
    # Do predictions
    print('kaggle_dl_quick_and_dirty: Do predictions')
    y_predictions = model.predict(X_validation)
    # Get scoring
    get_scoring(Y_validation, y_predictions)

    # Do prediction with unseen data
    if not p_test_df is None:
        p = p_test_df.copy()
        # Remove NaN values
        p.dropna(axis = 0, inplace = True)
        # Ignore categorical values
        p = p.select_dtypes(exclude=['object'])
        y_predictions = model.predict(p)
    # Evaluate the results when it is possible
    if not p_test_outputs_df is None:
        # Get scoring
        get_scoring(p_test_outputs_df, y_predictions)

    print('kaggle_dl_quick_and_dirty: Done')
    return model
    # End of function kaggle_dl_quick_and_dirty

So, the next step is to prepare the data for ML. Usually, you have better result when all the features (features and outputs) are in numerical format (int or float).

1. Feature engineering. It eliminates NULL or NaN values, duplicate values, and it transforms date/time column, categorical columns into numerical fetures. It identifies & handles outliers... (3.a). Categorical columns are usually of type object and the objective here is to transform these categorical columns into numerical columns. Date/time columns can be either object (e.g. date/time in string format) of type datetime64[ns]. For sepcific features such as 'Age', it is possible to create new feature grouping the Age values per range, between from the lower Age value to the upper Age value
2. Data transformation. It applies some numerical transformation such as standardization of features... (3.b)
3. Features selection. It selects and prepares the dataset for the training and the validation (3.c)

In [None]:
# 3. Prepare the dataset for your Machine Learning processing
# a) Data Cleaning
Encoders = dict()
Encoder_Instance = LabelEncoder() # Use global variable for future reverse features engineering
Imputer_Instance = None
def kaggle_features_engineering(p_df: pd.core.frame.DataFrame,                               
                                p_missing_value_method: str = 'drop_columns', 
                                p_duplicated_value_method: str = 'drop_columns', 
                                p_categorical_onehot_threshold: int = 10, 
                                p_date_time_columns: list = None, 
                                p_date_time_engineering: str = 'python_time',
                                p_outliers_lower_percentile = 25,
                                p_outliers_upper_percentile = 75,
                                p_outliers_impute_method = 'mean'
                               ) -> pd.core.frame.DataFrame:
    """
    This function performs a cleaning of the dataset to remove null values, duplicate values, based on the specified method
    :parameters p_df: The dataset handle
    :parameters p_missing_value_method: The method to cleanup NaN values in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_duplicated_value_method: The method to cleanup duplicated in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_categorical_onehot_threshold: The maximum cardinality to apply OneHotEncoder to a categorical variable. Defaut: 10
    :parameters p_date_time_engineering: The method to convert Date/Time. Default: 'python_time'
    :parameters p_outliers_lower_percentile: Percentile threshold for the Q1 (lower bound). Default: 25
    :parameters p_outliers_upper_percentile: Percentile threshold for the Q3 (upper bound). Default: 75
    :parameters p_outliers_impute_method: Method to use to impute outliers ('mean' or 'median'). Default: 'mean'
    :return: The dataset after the cleanup process
    """
    global Encoders, Encoder_Instance, Imputer_Instance
    
    print('----------------------------- kaggle_features_engineering -----------------------------')
    # Cleanup dataset
    old_shape = p_df.shape
    p = p_df.copy() # The final dataset

    # Apply feature processing resulting of the data analyzing
    p = kaggle_pre_features_engineering(p)

    # Build the list of categorical and numerical features
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
    print('kaggle_features_engineering: Initial categorical_columns:')
    print(categorical_columns)
    print('kaggle_features_engineering: Initial numerical_columns:')
    print(numerical_columns)
    if p_date_time_columns is not None:
        if len(categorical_columns) != 0:
            categorical_columns = list(set(categorical_columns) - set(p_date_time_columns))
        numerical_columns = list(set(numerical_columns) - set(p_date_time_columns))

    # Convert Date/time columns
    # dtype = 'datetime64[ns]'
    print('----------------------------- kaggle_features_engineering: Processing Date/Time columns')
    if p_date_time_columns is not None: # Process specified columns
        # Check date/time formats
        for column in p_date_time_columns: # TODO Check if all DateTime values have the same format, i.e. same length
            date_lengths = p[column].str.len().value_counts()
            print('kaggle_features_engineering: %s lengths:' % column)
            print('%s - %d' % (str(date_lengths), len(date_lengths)))
        # End of 'for' statement
        p[p_date_time_columns] = p[p_date_time_columns].astype('datetime64[ns]')
        p[p_date_time_columns] = p[p_date_time_columns].astype('int64')
        print('kaggle_features_engineering: Date/time columns processed')
    else:
        print('kaggle_features_engineering: No date/time values')
    # Be sure there is no more 'datetime64[ns]' types in the dataset
    datetime_columns = [col for col in p.columns if p[col].dtype == 'datetime64[ns]']
    if len(datetime_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has datetime64[ns] type in dataset', 'method=%s' % str(p.info()))

    # Find N/A values for categorical columns and replace them by the value with the higher frequency
    print('----------------------------- kaggle_features_engineering: Processing NaN values')
    categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
    if len(categorical_columns_with_nan) != 0:
        print('----------------------------- kaggle_features_engineering: Impute NaN values for categorical columns with MAX value')
        for col in categorical_columns_with_nan:
            p[col].fillna(p[col].value_counts().idxmax(), inplace = True)
        # End of 'for'statement
        # Check that there are no more categorical columns with NaN
        categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
        if len(categorical_columns_with_nan) != 0:
            raise Exception('kaggle_features_engineering: There still has categorical columns with NaN value in dataset', 'method=%s' % str(categorical_columns_with_nan))
    else:
        print('----------------------------- kaggle_features_engineering: No NaN value in categorical columns')
    # Use Imputation to replace NaN in numerical columns
    print('----------------------------- kaggle_features_engineering: Impute NaN values for numerical columns with %s method' % p_missing_value_method)
    numerical_columns_with_nan = [col for col in p.columns if (p[col].dtype == 'int64' or p[col].dtype == 'float64') and p[col].isna().sum() != 0]
    if len(numerical_columns_with_nan) != 0:
        print('kaggle_features_engineering: cols_with_missing: %s' % (str(numerical_columns_with_nan)))
        # Find rows with missing values
        rows_with_null = p[numerical_columns_with_nan].isnull().any(axis=1)
        rows_with_missing = p[rows_with_null]
        print('kaggle_features_engineering: rows_with_missing: %s/%s' % (rows_with_missing.shape[0], p.shape[0]))
        # Impute missimg values
        if p_missing_value_method == 'drop_columns': # Impute removing columns
            p = p.drop(numerical_columns_with_nan, axis = 1)
        elif p_missing_value_method == 'drop_lines' and len(rows_with_null) != 0: # Impute removing rows
            p = p.dropna()
        else: # Imputate using SimpleImputer
            if Imputer_Instance is None:
                if p_missing_value_method == 'mean':
                    Imputer_Instance = SimpleImputer(strategy='mean')
                elif p_missing_value_method == 'median':
                    Imputer_Instance = SimpleImputer(strategy='median')
                else:
                    raise Exception('kaggle_features_engineering: Invalid method', 'method=%s' % (p_missing_value_method))
            # else, nothing to do
            labels = p.columns # Save labels
            Imputer_Instance = Imputer_Instance.fit(p[numerical_columns_with_nan])
            p[numerical_columns_with_nan] = Imputer_Instance.transform(p[numerical_columns_with_nan])
            #p[numerical_columns_with_nan] = pd.DataFrame(Imputer_Instance.fit_transform(p))
            # Restore column names
            p.columns = labels
            print('kaggle_features_engineering: Cleaning NaN values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
            print(p.head())
            numerical_columns_with_nan = [col for col in p.columns if (p[col].dtype == 'int64' or p[col].dtype == 'float64') and p[col].isna().sum() != 0]
            if len(numerical_columns_with_nan) != 0:
                raise Exception('kaggle_features_engineering: There still has numerical columns with NaN value in dataset', 'method=%s' % str(numerical_columns_with_nan))
    else:
        print('kaggle_features_engineering: No missing values in numerical columns')
    print('----------------------------- kaggle_features_engineering: After First round:')
    #print(p.head())
    print(p.describe().T)

    print('----------------------------- kaggle_features_engineering: Rebuild lists of categorical and numerical columns')
    # FIXME Code seems to be useless
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
    print('kaggle_features_engineering: Rebuilt categorical_columns:')
    print(categorical_columns)
    print('kaggle_features_engineering: Rebuilt numerical_columns:')
    print(numerical_columns)

    # Search for categorical variables
    print('----------------------------- kaggle_features_engineering: Encoding categorical columns:')
    new_categorical_columns = []
    if len(categorical_columns) != 0:
        print('kaggle_features_engineering: categorical_columns: ' + str(categorical_columns))
        # Compute cardinalities of the categorical vairiables
        categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
        print('kaggle_features_engineering: categorical_columns_cardinalities: ')
        print(categorical_columns_cardinalities)
        print('kaggle_features_engineering: OneHotEncoder thresholds: %d' % p_categorical_onehot_threshold)
        # Apply OneHot encoding to categorical value with very low cardinality
        cols_processed = []
        new_categorical_columns = categorical_columns.copy()
        for i in range(0, len(categorical_columns)):
            if categorical_columns_cardinalities[i] <= p_categorical_onehot_threshold:
                print('kaggle_features_engineering: OneHotEncoder: %s' % categorical_columns[i])
                if not Encoders is None:
                    if not categorical_columns[i] in Encoders:
                        Encoders[categorical_columns[i]] = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
                else:
                    Encoders[categorical_columns[i]] = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
                new_col = Encoders[categorical_columns[i]].fit_transform(pd.DataFrame(p[categorical_columns[i]]))
                new_col = pd.DataFrame(new_col, columns = [(categorical_columns[i] + "_" + str(j)) for j in range(new_col.shape[1])])
                new_col.index = p[categorical_columns[i]].index
                p.drop(categorical_columns[i], inplace = True, axis = 1)
                p = pd.concat((p, new_col), axis = 1)
                cols_processed.append(categorical_columns[i])
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
                new_categorical_columns.extend(new_col.columns.tolist())
            else:
                print('!!!!!!!!!!!!!!!!!!!! kaggle_features_engineering: Cannot process %s' % categorical_columns[i])
                # Just drop them for the time being
                # FIXME To be refined using TargetEncoder
                p.drop(categorical_columns[i], axis = 1, inplace = True)
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
        # End of 'for' statement
        if len(cols_processed) != 0:
            print('kaggle_features_engineering: Encoders applied on %s' % str(cols_processed))
            print('kaggle_features_engineering: New datase structure:')
            #print(p.head())
            print(p.describe().T)
            categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
            print('kaggle_features_engineering: Cleaning categorical values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
            print('kaggle_features_engineering: new Categorical columns:')
            print(categorical_columns)
            # Compute new cardinalities of the categorical vairiables
            categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
            print('kaggle_features_engineering: New categorical_columns_cardinalities: ')
            print(categorical_columns_cardinalities)
        # TODO: Drop categorical variables with extrem cardinalities
        # Encode categorical variables using numerical mapping
        for col in categorical_columns:
            p[col] = Encoder_Instance.fit_transform(p[col].astype(str))
            # End of 'for' statement
            print('kaggle_features_engineering: Labelling:')
            #print(p.head())
            print(p.describe().T)
        # End of 'for' statement
    else:
        print('kaggle_features_engineering: No categorical values')
    # Be sure there is no more 'object' types in the dataset
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    if len(categorical_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has object type in dataset', 'method=%s' % str(categorical_columns))
    print('----------------------------- kaggle_features_engineering: After Second round:')
    #print(p.head())
    print(p.describe().T)

    # Identifying & handling outliers
    print('----------------------------- kaggle_features_engineering: Identifying & handling outliers:')
    for c in numerical_columns:
        if EXCLUDE_FROM_OULIERS is None or not c in EXCLUDE_FROM_OULIERS:
            q_lower, q_upper = np.percentile(p[c], p_outliers_lower_percentile), np.percentile(p[c], p_outliers_upper_percentile)
            iqr = q_upper - q_lower
            print('kaggle_features_engineering: IRQ range %f' % iqr)
            # Outlier cutoff threshold
            cut_off = iqr * 1.5
            if not cut_off == 0:
                lower_bound, upper_bound = q_lower - cut_off, q_upper + cut_off
                print('kaggle_features_engineering: Outliers thresholds for %s: (%f, %f)' % (c, lower_bound, upper_bound))
                outliers = [x for x in p[c] if x < lower_bound or x > upper_bound]
                if p_outliers_impute_method == 'mean':
                    outliers_impute_method = p[c].mean()
                else:
                    outliers_impute_method = p[c].median()
                if len(outliers) != 0:
                    print('kaggle_features_engineering: Outliers for %s' % c)
                    print(outliers)
                    p[c] = p[c].apply(lambda x: outliers_impute_method if x < lower_bound or x > upper_bound else x)
                else:
                    print('No outliers for %s' % c)
            else:
                print('No outliers for %s' % c)
    else:
        print('kaggle_features_engineering: %s is excluded from Outliers' % c)
    print('----------------------------- kaggle_features_engineering: After Third round:')
    #print(p.head())
    print(p.describe().T)
    
    p = kaggle_post_features_engineering(p)

    # Rebuild the list of categorical and numerical features
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
    if p_date_time_columns is not None:
        if len(categorical_columns) != 0:
            categorical_columns = list(set(categorical_columns) - set(p_date_time_columns))
        numerical_columns = list(set(numerical_columns) - set(p_date_time_columns))
    print('----------------------------- kaggle_features_engineering: Categorical columns after: ', categorical_columns)
    print('----------------------------- kaggle_features_engineering: Nunmerical columns after: ', numerical_columns)

    print('kaggle_features_engineering: ', list(new_categorical_columns)) 
    print('kaggle_features_engineering: Done') 
    return p, new_categorical_columns, numerical_columns
    # End of function kaggle_features_engineering

TODO: Add data engineering results

There are different kinds of data transformation:
- Standardization: It removes the mean and scaling to unit variance of the feature (see point 2.a)
- Scaling: It rescales the feature values in a range of 0 and 1

In [None]:
# b) Data Transforms
Transform = None
def kaggle_data_transform(p_df: pd.core.frame.DataFrame, p_columns:list = None, p_transform: str = 'standard') -> pd.core.frame.DataFrame:
    """
    Apply data transformation to the provided dataset
    :parameters p_df: The dataset handle
    :parameters p_columns: The columns to apply transformation (e.g. we don't apply transformation on categorical column)
    :parameter p_transform: The type of transormation. Default: 'standard'
                            'standard': Remove the mean and scaling to unit variance
                            'scale': Scale feature to a Min/max range
                            'abs_scale': Scale feature to a range [-1, 1]
    :return: The dataset after features selection
    """
    print('----------------------------- kaggle_data_transform -----------------------------')
    global Transform

    if Transform is None:
        if p_transform == 'standard':
            # Standardization, or mean removal and variance scaling
            Transform = StandardScaler()
        elif p_transform == 'scale':
            # Scaling features to a range
            Transform = MinMaxScaler()
        elif p_transform == 'abs_scale':
            # Scaling features to a range
            Transform = MaxAbsScaler()
        else:
            raise Exception('kaggle_data_transform: Wrong parameters', 'p_transform=%s' % p_transform)
        p = None
        if p_columns is None: # Apply transformamtion to the whole dataset
            p = Transform.fit_transform(p_df)
            p = pd.DataFrame(data = p, columns = p_df.columns)
        else:
            p = p_df.copy()
            p[p_columns] = Transform.fit_transform(p_df[p_columns])
    else:
        p = None
        if p_columns is None: # Apply transformamtion to the whole dataset
            p = Transform.transform(p_df)
            p = pd.DataFrame(data = p, columns = p_df.columns)
        else:
            p = p_df.copy()
            p[p_columns] = Transform.fit_transform(p_df[p_columns])
    print('kaggle_data_transform: Dataset Head:')
    print(p.head())
    
    print('kaggle_data_transform: Done')
    return p
    # End of function kaggle_data_transform

The kaggle_features_selection() function provides:
- Mutual Information (MI) measures the reduction in uncertainty for one variable given a known value of the other variable. Larger the MI value is, greater is the relationship between the two variables. The MI will help us during the features selection. 
The MI scores should be compared with the features importances in the model shown by the kaggle_explore_ml() function.
- Correlation matrix. Based on the threshold values, this function removes features with high level of correlation.

In [None]:
# b) Features Selection
def kaggle_features_selection(p_df: pd.core.frame.DataFrame, p_correlation_threshold:float = 0.7) -> pd.core.frame.DataFrame:
    """
    Reorganize the dataset to keep only provided attributes, the target column is the last column of the new dataset
    :parameters p_df: The dataset handle
    :parameters p_correlation_threshold: Correlation threshold to calculate lower bound and upper bound for feature removing
    :return: The dataset after features selection
    """
    print('----------------------------- kaggle_features_selection -----------------------------')
    # Mutual Information
    print('kaggle_visualization: Mutual information scores')
    show_mutual_information(p_df, p_title = 'Mutual Information Scores');
    # Build Correlation matrix
    print('----------------------------- kaggle_features_selection: Correlation table:')
    p = p_df.copy()
    p_corr = p.drop([TARGET_COLUMNS], axis = 1)
    # Extract correlation > 0.7 and < -0.7
    cor_matrix = p_corr.corr(method = 'pearson')
    print('----------------------------- kaggle_features_selection: cor_matrix:')
    print(cor_matrix)
    upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k = 1).astype(np.bool))
    print('----------------------------- kaggle_features_selection: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > p_correlation_threshold)]
    print('kaggle_features_selection: Drop ', to_drop)
    if len(to_drop) != 0:
        # Drop correlated columns
        p.drop(to_drop, axis = 1, inplace = True)
        print('----------------------------- kaggle_features_selection: new dataset:')
        print(p.head())
        print(p.describe().T)

    print('kaggle_features_selection: Done')
    return p, to_drop 
    # End of function kaggle_features_selection

Data augmentation is a technic to increase the size of the Training dataset by adding slightly modified copies of already existing data using technics such as flipping (left and/or right), zooming or adding noise. 

In [None]:
def kaggle_image_augmentation(
                              p_dataset, 
                              p_strategy, 
                              p_augmentations: list = None, 
                              p_shuffle_buffer_size: int = 2048, 
                              p_seed: int = SEED_HARCODED_VALUE
                              ):
    """
    TODO
    :parameter p_seed: The seed value for reproductibility
    """
    print('----------------------------- kaggle_image_augmentation -----------------------------')
    if not p_augmentations is None:
        dataset = p_dataset.map(kaggle_flip_left_right_data_augmentation, num_parallel_calls = tf.data.experimental.AUTOTUNE)
        dataset = dataset.repeat() # the training dataset must repeat for several epochs
        dataset = dataset.shuffle(p_shuffle_buffer_size, seed = p_seed) # Shuffle images of the batch at each iteration
        dataset = dataset.batch(DL_BATCH_SIZE * p_strategy.num_replicas_in_sync) # Prepare batches for DL
    else:
        dataset = p_dataset.batch(DL_BATCH_SIZE * p_strategy.num_replicas_in_sync) # Prepare batches for DL
        
    if p_augmentations is None:
        dataset = dataset.cache()
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE) # The next batch will always be ready for processing for the next iteration
    print('kaggle_image_augmentation: type(dataset) => ', type(dataset))
    print('kaggle_image_augmentation: Done')
    return dataset
    # End of function kaggle_image_augmentation  

def kaggle_flip_left_right_data_augmentation(p_image, p_label):
    """
    Note: Using method dataset.prefetch(tf.data.experimental.AUTOTUNE), 
          this processing happens essentially for free on TPU. 
          Data pipeline code is executed on the "CPU" part of the TPU 
          while the TPU itself is computing gradients.
    """
    print('----------------------------- kaggle_flip_left_right_data_augmentation -----------------------------')
    image = tf.image.random_flip_left_right(p_image)
    print('kaggle_flip_left_right_data_augmentation: Done')
    return image, p_label
    # End of function kaggle_flip_left_right_data_augmentation

TODO Add features selection comments

After cleaning and transforming the initial dataset, we can use it to train and validate our ML. So, The next step is to split our datasets in two parts:
- The features, the inputs of the ML/DL
- The target, the expected output of the ML/DL


In [None]:
# 4. Evaluate Algorithms
# a) Split-out validation dataset
def kaggle_split_dataset(p_df: pd.core.frame.DataFrame, p_target: str = TARGET_COLUMNS) -> list:
    """
    Split the into input features and target features
    :parameters p_df: The dataset handle
    :parameter p_target The outputs of the Machine Learning
    :return: The list of input features and target features
    """
    print('----------------------------- kaggle_split_dataset -----------------------------')
    y_values = None
    x_values = None
    if set([p_target]).issubset(set(p_df.columns)):
        y_values = p_df[[p_target]]
        x_values = p_df.drop([p_target], axis = 1)
    else:
        x_values = p_df
    
    # Re-order column by column name
    x_values = x_values.reindex(sorted(x_values.columns), axis = 1)
    
    print('----------------------------- kaggle_split_dataset: x_values:')
    print(x_values.head())
    if not y_values is None:
        print('----------------------------- kaggle_split_dataset: y_values:')
        print(y_values.head())

    print('kaggle_split_dataset: Done')
    return x_values, y_values
    # End of function kaggle_split_dataset

Now we can apply different models (linear, non-linear, ensemble...) to build our ML and evaluate their efficiency (4.b)

In [None]:
# b.1) Check models
def kaggle_check_models(
                        p_models: list, 
                        p_inputs_training_df: pd.core.frame.DataFrame, 
                        p_outputs_training_df: pd.core.frame.DataFrame, 
                        p_kparts: int = 10, 
                        p_random_state: int = SEED_HARCODED_VALUE, 
                        p_cross_validation: str = 'k-fold', 
                        p_scoring: str = 'accuracy') -> list:
    """
    Apply different models to train our Machine Learning and evaluate their efficiency
    :parameter p_models: A list of models to use for to train the Machine Learning
    :parameter p_inputs_training_df: The training inputs dataset (training attributes)
    :parameter p_outputs_training_df: The training output dataset (training target)
    :parameter p_inputs_valid_df: The validation inputs dataset (validation attributes)
    :parameter p_outputs_valid_df: The validation output dataset (validation target)
    :parameter p_kparts: The size of the KFolds
    :parameter p_random_state: 
    :parameter p_cross_validation: KFold or StratifiedKFold. Default: k-fold
    :parameter p_scoring: 
    :return: The list of couple (result, model name)
    """
    print('----------------------------- kaggle_check_models -----------------------------')
    results = []
    names = []
    for name, model in p_models:
        print('kaggle_check_models: Processing %s with type %s' % (name, type(model)))
        # Create train/test indices to split data in train/test sets
        if p_cross_validation == 'k-fold':
            kfold = model_selection.KFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        elif p_cross_validation == 's-k-fold':
            kfold = model_selection.StratifiedKFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        else:
            raise Exception('kaggle_check_models: Wrong parameters', 'p_cross_validation:%s' % p_cross_validation)
        cv_results = None
        # Evaluate model performance
        if p_cross_validation == 'k-fold' or p_cross_validation == 's-k-fold':
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p_outputs_training_df, cv = kfold, scoring = p_scoring)
        else:
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p_outputs_training_df, cv = LeaveOneOut(), scoring = p_scoring)
        #print('kaggle_check_models: cv_result=%s' % str(cv_results))
        results.append(cv_results)
        names.append(name)
        msg = 'kaggle_check_models: %s metric: %s: %f (+/-%f)' % (p_scoring, name, cv_results.mean(), 2 * cv_results.std())
        print(msg)
        # End of 'for' loop

    print('kaggle_check_models: Done')
    return results, names
    # End of function kaggle_check_models

In [None]:
# b.2) Check models for CNN
def kaggle_check_dl_model(p_strategy, p_model, p_train_df, p_train_df_size, p_validation_df, p_validation_df_size):
    """
    """
    print('----------------------------- kaggle_check_dl_model -----------------------------')
    with p_strategy.scope():
        print('kaggle_check_dl_model: Setup LearningRateScheduler:')
        
        lr_callback = tf.keras.callbacks.LearningRateScheduler(kaggle_build_learning_rate_schedule, verbose = True)
        kaggle_display_learning_rate_curve(kaggle_build_learning_rate_schedule)
        temp_name = next(tempfile._get_candidate_names()) + '.h5'
        # Training the model
        print('kaggle_check_dl_model: Train the model:')
        history = p_model.fit(
            p_train_df,
            validation_data = p_validation_df,
            epochs = DL_EPOCH_NUM,
            steps_per_epoch = p_train_df_size // (DL_BATCH_SIZE * p_strategy.num_replicas_in_sync), # p_train_df_size is the initial size of the Train dataset.
                                                                                                    # After data augmentation, this size is much much higher
            callbacks = [lr_callback, ModelCheckpoint(filepath = temp_name, monitor = 'val_loss', save_best_only = True)],
            verbose = 0
        )
        # End of 'with' statement

    # Display leaning curve
    kaggle_display_learning_curves(
        history.history['loss'],
        history.history['val_loss'],
        'loss',
        211,
    )
    kaggle_display_learning_curves(
        history.history['sparse_categorical_accuracy'],
        history.history['val_sparse_categorical_accuracy'],
        'accuracy',
        212,
    )
    # Show scores
    # Split Validation images & labels
    print('kaggle_check_dl_model: Split images & labels for validation_df')
    validation_images_df = p_validation_df.map(lambda image, label: image)
    validation_labels_df = p_validation_df.map(lambda image, label: label).unbatch()
    # Convert labels into a Pandas dataframe
    #l = []
    #for i, e in validation_labels_df.enumerate():
    #    l[i] = int(e.numpy())
    #validation_labels_df = pd.DataFrame(l)
    #print('type(validation_labels_df) ==>', type(validation_labels_df))
    #print('Main: validation_labels_df.shape = ', len(validation_labels_df), ' / ', validation_df_size)
    #raise Exception('Stop')

    # Display confusion matrix
    cm_correct_labels = next(iter(validation_labels_df.batch(p_validation_df_size))).numpy() #np.darray
    print('kaggle_check_dl_model: cm_correct_labels ==>', cm_correct_labels)
    cm_probabilities = kaggle_prediction(p_model, validation_images_df)
    print('kaggle_check_dl_model: cm_probabilities ==>', cm_probabilities)
    y_predictions = np.argmax(cm_probabilities, axis = -1)
    print('kaggle_check_dl_model: y_predictions ==>', y_predictions)
    labels = range(len(IMAGE_CLASSES))
    cmat = confusion_matrix(
                            cm_correct_labels,
                            y_predictions,
                            labels = labels,
                            )
    cmat = (cmat.T / cmat.sum(axis=1)).T # normalize
    print('kaggle_check_dl_model: cmat = ', cmat)
    # Display scores
    score = f1_score(
        cm_correct_labels,
        y_predictions,
        labels=labels,
        average = 'macro',
    )
    print('kaggle_check_dl_model: Score = ', score)
    precision = precision_score(
                                cm_correct_labels,
                                y_predictions,
                                labels = labels,
                                average = 'macro',
    )
    print('kaggle_check_dl_model: Precision score = ', precision)
    recall = recall_score(
                          cm_correct_labels,
                          y_predictions,
                          labels = labels,
                          average='macro',
    )
    print('kaggle_check_dl_model: Recall score = ', recall)
    kaggle_display_confusion_matrix(cmat, score, precision, recall)

    # Retrieve the best weights for the model
    print('kaggle_check_dl_model: Retrieve the best weights for the model')
    model = tf.keras.models.load_model(temp_name)
    os.remove(temp_name)

    print('kaggle_check_dl_model: Done')
    return (score, precision, recall), history.history, model
    # End of function kaggle_check_dl_model

Then, we select the best model based on the scoring (4.c)

In [None]:
def kaggle_compare_algorithms_perf(p_names: list, p_metrics: list, p_title: str, p_x_label: str, p_y_label:str) -> int:
    """
    Compare and return the model with the best results
    :parameter p_names: The list of models executed
    :parameter p_metrics: The scorimng obtained by each model
    :parameter p_title: Performance plot title
    :parameter p_x_label: Performance plot X-axis label
    :parameter p_y_label: Performance plot Y-axis label
    :return: The index of the model with the higher scoring
    """
    print('----------------------------- kaggle_compare_algorithms_perf -----------------------------')
    # Extract means & std
    means = []
    stds = []
    for i in range (len(p_names)):
        cv_results = p_metrics[i]
        means.append(cv_results.mean())
        stds.append(cv_results.std())
        # End of 'for' statement
    # Display means/standard deviation
    plt.title(p_title)
    plt.xlabel(p_x_label)
    plt.ylabel(p_y_label)
    plt.errorbar(p_names, means, stds, linestyle='None', marker='^')
    #plt.savefig('kaggle_algorithms_comparison.png')
    plt.show()
    # Select the best algorithm
    m = np.array(means)
    maxv = np.amax(m)
    idx = np.where(m == maxv)[0][0]
    print('kaggle_compare_algorithms_perf: Max value: %d:%f +/- %f ==> %s' % (idx, maxv, 2 * stds[idx], p_names[idx]))

    print('kaggle_compare_algorithms_perf: Done')
    return idx
    # End of function kaggle_compare_algorithms_perf

Finally, we will use the GridSearchCV() function to find the best model parameters.

In [None]:
# 5. Improve Accuracy
# a) Algorithm Tuning
def kaggle_algorithm_tuning(
                            p_algorithm: list, 
                            p_inputs_training_df: pd.core.frame.DataFrame, 
                            p_outputs_training_df: pd.core.frame.DataFrame, 
                            p_validation_data: list = None,
                            p_strategy = None, 
                            p_kparts: int = 10, 
                            p_random_state: int = SEED_HARCODED_VALUE
                            ):
    """
    Tuning model parameters
    :parameter p_algorithm: The ML model to tune
    :parameter p_inputs_training_df: The input training tada
    :parameter p_outputs_training_df: The target for the training data
    :parameter p_validation_data: The input validation data
    :return: The tuned model
    """
    print('----------------------------- kaggle_algorithm_tuning -----------------------------')
    model = p_algorithm[1]
    print('----------------------------- kaggle_algorithm_tuning: %s' % model.__class__.__name__)
    print('----------------------------- kaggle_algorithm_tuning: model summary:')
    print(model)
    print(model.get_params())

    # Fit the model
    if model.__class__.__name__ == 'LinearRegression': # No Hyper parameters tuning
        # Hyper parameters tuning
        print('kaggle_algorithm_tuning: No Hyper parameters tuning for LinearRegression')
        model.fit(p_inputs_training_df, p_outputs_training_df)
        return model
    elif not model.__class__.__name__.startswith('Keras'):
        # Hyper parameters tuning
        print('----------------------------- kaggle_algorithm_tuning: Hyper parameters tuning')
        if model.__class__.__name__.startswith('SV'):
            param_grid = {
                'C': [0.1, 1, 10, 100], 
                'gamma': [1, 0.1, 0.01, 0.001],
                'kernel': ['rbf', 'poly', 'sigmoid']
            }
        elif model.__class__.__name__ == 'LinearDiscriminantAnalysis':
            param_grid = {
                'shrinkage': [ 'auto', 0, 0.5, 1.0 ],
                'solver': [ 'svd', 'lsqr', 'eigen' ], 
                'tol': [1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1]
            }
        elif model.__class__.__name__.startswith('LogisticRegression'):
            param_grid = {
                'C': [0.1, 1, 10, 100], 
                'penalty': ['l1','l2'],
                'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag'],
                'multi_class': ['ovr', 'multinomial'],
                'class_weight': ['None', 'balanced'],
                'max_iter': [100, 150, 200], 
                'tol': [1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1]
            }
        elif model.__class__.__name__.startswith('KNeighbors'):
            param_grid = {
                'n_neighbors': [4, 8, 16, 32], 
                'weights': ['uniform', 'distance'],
                'algorithm': ['ball_tree', 'kd_tree', 'brute']
            }
        elif model.__class__.__name__.startswith('LGBM'):
            param_grid = {
                'num_leaves': [32, 128],
                'reg_alpha': [0.1, 0.5],
                'min_data_in_leaf': [32, 64, 128, 256],
                'lambda_l1': [0, 1, 1.5],
                'lambda_l2': [0, 1],
                'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.4, 0.6],
            }
        else:
            # Global grid parameters
            param_grid = {
                'n_estimators': [512, 1024, 2048],
                'max_depth': [16, 24 , 32],
                'max_leaf_nodes': [64, 128, 256],
                'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.4, 0.6],
                #'loss': ['deviance'],
                #'min_samples_split': np.linspace(0.1, 0.5, 3),
                #'min_samples_leaf': np.linspace(0.1, 0.5, 3),
                #'max_features': ['log2', 'sqrt'],
                #'criterion': ['friedman_mse',  'mae'],
                #'subsample': np.linspace(0.5, 1.0, 3),
            }
            # Remove unsupported parameters
            if model.__class__.__name__.startswith('RandomForest'):
                del param_grid['learning_rate']

        cv = model_selection.RepeatedStratifiedKFold(n_splits = p_kparts, n_repeats = 10, random_state = p_random_state) #5 for hardcoded value
        tunning = GridSearchCV(
            estimator = model,
            param_grid = param_grid, 
            cv = cv, 
            n_jobs = 5, 
            scoring = 'neg_mean_squared_error',
            verbose = 2
        )
        model = tunning.fit(p_inputs_training_df, p_outputs_training_df)
    elif model.__class__.__name__.startswith('Keras'):
        with p_strategy.scope():
            param_grid = {
                #'optimizer': [ 'rmsprop', 'adam' ],
                #'init': [ 'glorot_uniform' , 'normal' , 'uniform' ],
                'nb_epoch': [48, 64],
                'batch_size': [ 48 * p_strategy.num_replicas_in_sync, 64 * p_strategy.num_replicas_in_sync ]
            }
            cv = model_selection.RepeatedStratifiedKFold(n_splits = p_kparts, n_repeats = 10, random_state = p_random_state) #5 for hardcoded value
            tunning = GridSearchCV(estimator = model, param_grid = param_grid, cv = cv, n_jobs = 1, verbose = 2) # Cannot use n_jobs = 5 (-1) because sklean does not support pickle
                                                                                                                 # See https://stackoverflow.com/questions/48717451/keras-kerasclassifier-gridsearch-typeerror-cant-pickle-thread-lock-objects
            early_stopping = EarlyStopping(
                min_delta = 0.001, # minimium amount of change to count as an improvement
                patience = 20, # how many epochs to wait before stopping
                restore_best_weights = True,
            )
            model = tunning.fit(p_inputs_training_df, 
                                p_outputs_training_df, 
                                validation_data = p_validation_data, 
                                epochs = DL_EPOCH_NUM, 
                                batch_size = DL_BATCH_SIZE * p_strategy.num_replicas_in_sync, 
                                callbacks = [early_stopping]
                               )
    else:
        raise Exception('kaggle_algorithm_tuning: Wrong parameters combination', 'DL with p_strategy set to None')

    print('----------------------------- kaggle_algorithms_tuning: Hyper parameters tuning ended:')
    print('kaggle_algorithm_tuning: Hyper parameters tuning: best_score_=', model.best_score_)
    print('kaggle_algorithm_tuning: Hyper parameters tuning: best_params_=', model.best_params_)
    print('kaggle_algorithm_tuning: Hyper parameters tuning: best_estimator_=', model.best_estimator_)
    for mean, std, params in zip(model.cv_results_['mean_test_score'], model.cv_results_['std_test_score'], model.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    model = model.best_estimator_
    
    print('kaggle_algorithm_tuning: Done')
    return model
    # End of function kaggle_algorithm_tuning

Now, we reached the point where we can evaluate our model with Validation and/or Test datasets.

In [None]:
# b) Ensembles
# 6. Finalize Model
# a) Predictions on validation dataset
def kaggle_validation_prediction(p_model, p_inputs, p_expected_outputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) and compares outputs against expected outputs (Validation) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs: 
    :parameter p_expected_outputs: 
    """
    print('----------------------------- kaggle_validation_prediction -----------------------------')
    print('kaggle_validation_prediction: model=%s - is_class:%s - is_regr:%s' % (p_model.__class__.__name__, str(is_classifier(p_model)), str(is_regressor(p_model))))
    y_predictions = p_model.predict(p_inputs)
    print('kaggle_validation_prediction: Score: ', round(p_model.score(p_inputs, p_expected_outputs) * 100, 2), " %")
    if is_regressor(p_model) or p_model.__class__.__name__ == 'KerasRegressor': # Regression metrics (continuous target values)
        # Get scoring
        get_scoring(p_expected_outputs, y_predictions)
        # Analyze residual errors
        plt.scatter(p_expected_outputs, y_predictions)
        plt.show()
        # TODO Interpreting the Cofficients if possible
    elif is_classifier(p_model) or p_model.__class__.__name__ == 'KerasClassifier': # Classification metrics (class target values)
        # Get scoring
        get_scoring(p_expected_outputs, y_predictions)
    else:
        raise Exception('kaggle_validation_prediction: Invalid model')
    print('kaggle_validation_prediction: prediction is %s' % (str(y_predictions)))

    print('kaggle_validation_prediction: Done')
    return y_predictions
    # End of function kaggle_validation_prediction

def kaggle_prediction(p_model, p_inputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs:  
    """
    print('----------------------------- kaggle_prediction -----------------------------')
    y_prediction = p_model.predict(p_inputs)
    print('kaggle_prediction: prediction is %s' %(str(y_prediction)))
    print('kaggle_prediction: Done')
    return y_prediction
# b) Create standalone model on entire training dataset
# TODO

The functions below are some helpers to save or reload the model and features engineering data.

In [None]:
# Pickling Keras Models to prevent error 'TypeError: can't pickle weakref objects' while saving the model
# http://zachmoshe.com/2017/04/03/pickling-keras-models.html
def make_keras_picklable():
    """
    This function is used to pickling Keras Models to prevent error 'TypeError: can't pickle weakref objects' while saving the model
    See http://zachmoshe.com/2017/04/03/pickling-keras-models.html
    """
    def __getstate__(self):
        model_str = ""
        with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
            keras.models.save_model(self, fd.name, overwrite=True)
            model_str = fd.read()
        d = { 'model_str': model_str }
        return d

    def __setstate__(self, state):
        with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
            fd.write(state['model_str'])
            fd.flush()
            model = keras.models.load_model(fd.name)
        self.__dict__ = model.__dict__


    cls = keras.models.Model
    cls.__getstate__ = __getstate__
    cls.__setstate__ = __setstate__

In [None]:
# c) Save model for later use
def kaggle_save_model(p_model, p_paths: str, p_file_name:str) -> None:
    """
    Save the provided model in binary/pickle format and the Encoders/Imputers
    :parameter p_model: The ML model to save
    :parameter p_paths: The path to save the files, ending with a '/' (e.g. ./)
    :parameter p_file_name: The file name woithout extension file (e.g. './MyModel')
    """
    global Encoders, Encoder_Instance, Imputer_Instance, Transform

    print('----------------------------- kaggle_save_model -----------------------------')
    make_keras_picklable()
    # Serialize the model
    pickle.dump(p_model, open(p_paths + p_file_name + '.pkl', 'wb'))
    print('kaggle_save_model: Done: %s' % (p_file_name + '.pkl'))
    # Save Encoders, Encoder_Instance and Imputer_Instance
    if not Encoders is None and len(Encoders) != 0:
        pickle.dump(Encoders, open(p_paths + 'Encoders.pkl', 'wb'))
    if not Encoder_Instance is None:
        pickle.dump(Encoder_Instance, open(p_paths + 'Encoder_Instance.pkl', 'wb'))
    if not Imputer_Instance is None:
        pickle.dump(Imputer_Instance, open(p_paths + 'Imputer_Instance.pkl', 'wb'))
    if not Transform is None:
        pickle.dump(Transform, open(p_paths + 'Transform.pkl', 'wb'))

    print('kaggle_save_model: Done')
    # End of function kaggle_save_model

In [None]:
# Load model
def kaggle_load_model(p_paths: str, p_file_name:str):
    """
    Load a model in binary/pickle format and the Encoders/Imputers
    :parameter p_model: The ML model to save
    :parameter p_paths: The path to save the files, ending with a '/' (e.g. ./)
    :parameter p_file_name: The file name woithout extension file (e.g. './MyModel')
    """
    global Encoders, Encoder_Instance, Imputer_Instance, Transform

    print('----------------------------- kaggle_load_model -----------------------------')
    try:
        Encoders = pickle.load(open(p_paths + 'Encoders.pkl', 'rb'))
    except:
        Encoders = dict()
    try:
        Encoder_Instance = pickle.load(open(p_paths + 'Encoder_Instance.pkl', 'rb'))
    except:
        Encoder_Instance = LabelEncoder()
    try:
        Imputer_Instance = pickle.load(open(p_paths + 'Imputer_Instance.pkl', 'rb'))
    except:
        Imputer_Instance = None
    try:
        Transform = pickle.load(open(p_paths + 'Transform.pkl', 'rb'))
    except:
        Transform = None
    model = pickle.load(open(p_paths + p_file_name + '.pkl', 'rb'))
    
    print('kaggle_load_model: Done')
    return model
    # End of function kaggle_load_model

The function kaggle_explore_ml() provides some insights from our ML.

# PIMA diabetes
For PIMA India, if we cmpare the Mutual Information scores with the most important feature, the Body mass index feature seems to be main key of the diabete.


In [None]:
def kaggle_explore_ml(p_model, p_x_validation: pd.core.frame.DataFrame, p_y_validation: pd.core.frame.DataFrame, p_random_state:int = SEED_HARCODED_VALUE) -> None:
    """
    Apply feature importance concept to our ML 
    :parameter p_model: The predictions to save
    """
    print('----------------------------- kaggle_explore_ml -----------------------------')
    result = permutation_importance(p_model, p_x_validation, p_y_validation, n_repeats = 32, random_state = p_random_state)
    sorted_idx = result.importances_mean.argsort()
    print('----------------------------- kaggle_explore_ml: result:')
    print(sorted_idx)

    fig, ax = plt.subplots()
    ax.boxplot(result.importances[sorted_idx].T, vert = False, labels = p_x_validation.columns[sorted_idx])
    ax.set_title("Permutation Importances (Validation set)")
    fig.tight_layout()
    plt.show()
    print('kaggle_explore_ml: Done')
    # End of function kaggle_explore_ml

The function kaggle_save_result() saves prediction results in Kaggle format for submission to Kaggle Compete

In [None]:
def kaggle_save_result(p_model, p_column:str, p_predictions: np.ndarray, p_test_file_name: str, p_file_name:str) -> None:
    """
    Save prediction results in Kaggle format for submission to compete
    :parameter p_model: The predictions to save
    :parameter p_column: The index column name
    :parameter p_predictions: The prediction results based on Test dataset
    :parameter p_test_file_name: The original validation dataset, to extract the index for Kaggle submission (see p_column)
    :parameter p_file_name: The file name without extension file (e.g. './MyResults.csv')
    """
    print('----------------------------- kaggle_save_result -----------------------------')
    # Reload PassengerID list
    validation_df = pd.read_csv(p_test_file_name)
    p = validation_df[[p_column]].astype(int)
    # FIXME How to proceed with multiple outputs?
    # Write the submission file
    print(p.shape)
    print(p_predictions.shape)
    print(len(p_predictions.squeeze()))
    pred = pd.DataFrame({p_column: list(p.squeeze()), TARGET_COLUMNS: p_predictions.astype(int).squeeze()})
    pred.to_csv(p_file_name, index = False)
    print('kaggle_save_result: Done: ', p_file_name)
    # End of function kaggle_save_model

def kaggle_save_result_dl(p_model, p_columns:list, p_predictions: np.ndarray, p_test_df, test_df_size: int, p_file_name:str) -> None:
    """
    Save prediction results in Kaggle format for submission to compete based on Images processing
    :parameter p_model: The predictions to save
    :parameter p_column: The list of the columns name
    :parameter p_predictions: The prediction results based on Test dataset
    :parameter p_test_df: The test dataset containing unlabelled images
    :parameter test_df_size: The size of the test dataset (# of images)
    :parameter p_file_name: The file name without extension file (e.g. './MyResults.csv')
    """
    print('----------------------------- kaggle_save_result_dl -----------------------------')
    print('p_predictions ==>', p_predictions)
    print(p_predictions.shape)
    print(len(p_predictions.squeeze()))
    # Get image ids from test set and convert to unicode
    test_ids_ds = p_test_df.map(lambda image, idnum: idnum).unbatch()
    test_ids = next(iter(test_ids_ds.batch(test_df_size))).numpy().astype('U')
    print('type(test_ids) ==>', type(test_ids))
    print('test_ids ==>', test_ids)
    # Write the submission file
    pred = pd.DataFrame({p_columns[0]: list(test_ids.squeeze()), p_columns[1]: p_predictions.astype(int).squeeze()})
    pred.to_csv(p_file_name, index = False)
    print('kaggle_save_result_dl: Done: ', p_file_name)
    # End of function kaggle_save_result_dl

Finaly, here is the entry point function and the main call:

In [None]:
def kaggle_main() -> None:
    global DL_INPUT_SHAPE
    
    # Set defaults
    kaggle_set_seed()
    kaggle_set_mp_default()
    
    # Current path
    print(os.path.abspath(os.getcwd()))
    # Kaggle current path and files
    #for dirname, _, filenames in os.walk('/kaggle/input'):
    #    for filename in filenames:
    #        print(os.path.join(dirname, filename))

    # Modules version
    kaggle_modules_versions()

    flags = FLAGS
    # Parse arguments. Used only if this notebook code is used as a standalone Python script
    #parser = argparse.ArgumentParser()
    #parser.add_argument('--summarize', help = 'Process statistical analyze', action='store_true')
    #parser.add_argument('--summarize-only', help = 'Process only statistical analyze', action='store_true')
    #parser.add_argument('--visualize', help = 'Generate different plots based on statistical analyze', action='store_true')
    #parser.add_argument('--no-data-cleaning', help = 'Do not apply Data Cleaning', action='store_true')
    #parser.add_argument('--neural-network', help = 'Use neural network as ML', action='store_true')
    #args = parser.parse_args()
    #if args.summarize or args.summarize_only:
    #    flags |= ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG
    #if args.visualize:
    #    flags |= ExecutionFlags.DATA_VISUALIZATION_FLAG
    #if args.no_data_cleaning:
    #    flags |= ~ExecutionFlags.DATA_CLEANING_FLAG
    #if args.neural_network:
    #    flags |= ExecutionFlags.USE_NEURAL_NETWORK_FLAG
    
    # TODO Uncomment if using Pima Indians diabetes dataset
    #flags &= ~ExecutionFlags.DATA_CLEANING_FLAG
    print('Main: Generic template approach to ''play'' with the Machine Learning concepts: flags=%s' % str(flags))
    print('Main: Playgrounf name: ', ML_NAME)

    kaggle_pre_main()

    strategy, global_path = kaggle_tpu_detection()
    if global_path is None: # Force global_path to local path when datasets not copied to gs://kds-xxx
        global_path = os.path.join(os.path.abspath(os.getcwd()), '../input')
        global_path = os.path.join(global_path, DATABASE_NAME)

    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        y_test_df = None
        if ML_NAME == 'Pima':
            train_df, validation_df, test_df, y_test_df = kaggle_load_datasets(
                                                                               p_url = DATABASE_URI,
                                                                               p_labels = COLUMNS_LABEL
                                                                               )
        elif ML_NAME == 'Iris':
            train_df, validation_df, test_df, y_test_df = kaggle_load_datasets(
                                                                               p_url = DATABASE_URI,
                                                                               p_labels = COLUMNS_LABEL
                                                                               )
        elif ML_NAME == 'Titanic':
            train_df, validation_df, test_df, _ = kaggle_load_datasets(
                                                                       p_url = None, 
                                                                       p_train_path = '../input/titanic/train.csv', 
                                                                       p_validation_path = '../input/titanic/test.csv'
                                                                       )
        else:
            raise Exception('A dataset shall be selected')
        print('Main: training dataset')
        print(train_df.head())
        print('Main: validation dataset')
        print(validation_df.head())
        print('Main: test dataset')
        print(test_df.head())
        print('Main: test outputs dataset is %s ' % ('empty' if y_test_df is None else 'not null'))
        # Do a basic ML evaluation as reference for later result comparisons
        if not flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
            kaggle_ml_quick_and_dirty(train_df, validation_df, test_df, p_test_outputs_df = y_test_df)
        else:
            kaggle_dl_quick_and_dirty(train_df, validation_df, strategy, test_df, p_test_outputs_df = y_test_df)
    else:
        # We are loading Images datasets
        if ML_NAME == 'FlowerClassification':
            # Images databse already in Tfrec format
            image_folder = 'tfrecords-jpeg-' + str(IMAGE_NUM_PIXELS) + 'x' + str(IMAGE_NUM_PIXELS)
            train_df, validation_df, test_df, train_df_size, validation_df_size, test_df_size = kaggle_load_images_datasets(p_train_url = None,
                                                                                                                            p_train_path = image_folder + '/train/*.tfrec',
                                                                                                                            p_validation_path = image_folder + '/val/*.tfrec',
                                                                                                                            p_global_path = global_path,
                                                                                                                            p_test_path = image_folder + '/test/*.tfrec'
                                                                                                                            )
        elif ML_NAME == 'BMS-MolecularTranslation':
            # Images and lables are separated
            train_df, validation_df, test_df, train_df_size, validation_df_size, test_df_size = kaggle_load_images_datasets(p_train_url = DATABASE_NAME,
                                                                                                                            p_labels = COLUMNS_LABEL, 
                                                                                                                            p_global_path = global_path, 
                                                                                                                            p_test_url = DATABASE_NAME
                                                                                                                            )
        elif ML_NAME == 'HumanProteinAtlas':
            # Images and lables are separated
            train_df, validation_df, test_df, train_df_size, validation_df_size, test_df_size = kaggle_load_images_datasets(p_train_url = DATABASE_NAME,
                                                                                                                            p_labels = COLUMNS_LABEL, 
                                                                                                                            p_global_path = global_path,
                                                                                                                            p_test_url = DATABASE_NAME
                                                                                                                            )
        else:
            raise Exception('A dataset shall be selected')
        print('type(train_df) ==> ', type(train_df))
        print('Main: train_df_size', train_df_size)
        print('type(validation_df) ==> ', type(validation_df))
        print('Main: validation_df_size', validation_df_size)
        print('type(train_df_size) ==> ', type(train_df_size))
        print('Main: test_df_size', test_df_size)
    # End of datasets loading operation

#    raise Exception('Stop')

    # Data visualization
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        numerical_columns = None
        categorical_columns = None
        if flags & ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG == ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG:
            print('Main: Summurize data for training dataset')
            kaggle_summurize_data(train_df)
            print('Main: Summurize data for validation dataset')
            kaggle_summurize_data(validation_df)
            print('Main: Summurize data for test dataset')
            kaggle_summurize_data(test_df)
        #    if args.summarize_only:
        #        return

        if flags & ExecutionFlags.DATA_VISUALIZATION_FLAG == ExecutionFlags.DATA_VISUALIZATION_FLAG:
            print('Main: Visualisation for train dataset')
            kaggle_visualization(train_df)
            cross_dataset_visualization([train_df, validation_df])
    else:
        # 1. Data augmentation and batches setup
        train_df = kaggle_image_augmentation(train_df, strategy, ['flip_right_left'])
        validation_df = kaggle_image_augmentation(validation_df, strategy) # No data augmentation for Validation dataset
        copy_validation_df = validation_df # Keep a copy of the original validation_df for model evaluation
        test_df = test_df.batch(DL_BATCH_SIZE * strategy.num_replicas_in_sync) # Prepare batches for DL
        test_df = test_df.prefetch(tf.data.experimental.AUTOTUNE) # The next batch will always be ready for processing for the next iteration
        print('Main: Dataset: %d training images, %d validation images, %d unlabeled test images' % (train_df_size, validation_df_size, test_df_size))
        # 2. Data visualization
        if flags & ExecutionFlags.DATA_VISUALIZATION_FLAG == ExecutionFlags.DATA_VISUALIZATION_FLAG:
            print("Main: Training images:", train_df)
            kaggle_image_visualization(train_df)
            print ("Main: Validation images:")
            kaggle_image_visualization(validation_df)
            print("Main: Test images:")
            kaggle_image_visualization(test_df)
        else:
            raise Exception('Main', 'To be continued')
    # End of Data visualization

#    raise Exception('Stop')

    # Data engineering
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        categorical_features = None
        numerical_features = None
        if flags & ExecutionFlags.DATA_CLEANING_FLAG == ExecutionFlags.DATA_CLEANING_FLAG:
            train_df, new_categorical_features, numerical_features = kaggle_features_engineering(train_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
            print(train_df.columns)
            validation_df, _, _ = kaggle_features_engineering(validation_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
            print(validation_df.columns)
            if len(train_df.columns) != len(validation_df.columns):
                l = list(set(train_df.columns) - set(validation_df.columns))
                print('Main: Features unaligned for validation_df: ', l)
                validation_df[l] = 0
                print(validation_df.columns)
            test_df, _, _ = kaggle_features_engineering(test_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
            print(test_df.columns)
            if len(train_df.columns) != len(test_df.columns):
                l = list(set(train_df.columns) - set(test_df.columns) - set([TARGET_COLUMNS]))
                print('Main: Features unaligned for test_df: ', l)
                test_df[l] = 0
                print(test_df.columns)
            print('Main: training dataset after data engineering')
            print(train_df.head())
            print('Main: validation dataset after data engineering')
            print(validation_df.head())
            print('Main: test dataset after data engineering')
            print(test_df.head())
            # Visualization after data engineering
            if flags & ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG == ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG:
                print('Main: Summurize data for training dataset')
                kaggle_summurize_data(train_df)
                print('Main: Summurize data for validation dataset')
                kaggle_summurize_data(validation_df)
                print('Main: Summurize data for test dataset')
                kaggle_summurize_data(test_df)
            # Do a basic ML evaluation as reference for the end
            if not flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                kaggle_ml_quick_and_dirty(train_df, validation_df, test_df, p_test_outputs_df = y_test_df)
            else:
                kaggle_dl_quick_and_dirty(train_df, validation_df, strategy, test_df, p_test_outputs_df = y_test_df)

        if flags & ExecutionFlags.DATA_TRANSFORM_FLAG == ExecutionFlags.DATA_TRANSFORM_FLAG:
            # Extract non  categorical columns based on categorical_column list
            if not numerical_features is None:
                columns_to_transform = numerical_features
                if not NON_TRANSFORMABLE_COLUMNS is None:
                    columns_to_transform = list(set(columns_to_transform) - set(NON_TRANSFORMABLE_COLUMNS))
                if not OUTPUT_IS_REGRESSION:
                    # Remove TARGET_COLUMNS from columns_to_transform list because it is a classification (yes, no)
                    columns_to_transform = list(set(columns_to_transform) - set([TARGET_COLUMNS]))
                print('Main: columns_to_transform: %s' % str(columns_to_transform))
                print('Main: columns_to_transform: ')
                print(train_df.describe())
                train_df = kaggle_data_transform(train_df, columns_to_transform, p_transform = 'standard')
                validation_df = kaggle_data_transform(validation_df, columns_to_transform, p_transform = 'standard')
                if OUTPUT_IS_REGRESSION:
                    # Remove TARGET_COLUMNS from columns_to_transform list because test_df does not contain targets
                    columns_to_transform = list(set(columns_to_transform) - set([TARGET_COLUMNS]))
                test_df = kaggle_data_transform(test_df, columns_to_transform, p_transform = 'standard')
                print('Main: training dataset after features transformation')
                print(train_df.head())
                print('Main: validation dataset after features transformation')
                print(validation_df.head())
                print('Main: test dataset after features transformation')
                print(test_df.head())
                # Do a basic DL evaluation as reference for the end
                if not flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                    kaggle_ml_quick_and_dirty(train_df, validation_df, test_df, p_test_outputs_df = y_test_df)
                else:
                    kaggle_dl_quick_and_dirty(train_df, validation_df, strategy, test_df, p_test_outputs_df = y_test_df)

        train_df, dropped_features = kaggle_features_selection(train_df)
        #dropped_features = []
        if len(dropped_features) != 0:
            if set(dropped_features).issubset(set(validation_df.columns)):
                validation_df.drop(dropped_features, inplace = True, axis = 1)
            if set(dropped_features).issubset(set(test_df.columns)):
                test_df.drop(dropped_features, inplace = True, axis = 1)
            print('Main: training dataset after features selection')
            print(train_df.head())
            print('Main: validation dataset after features selection')
            print(validation_df.head())
            print('Main: test dataset after features selection')
            print(test_df.head())
    else:
        # Nothing to do
        pass
    # End of Data engineering    

    # Building models and training operation
    models = []
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        ml_inputs_training_df, ml_outputs_training_df = kaggle_split_dataset(train_df)
        ml_inputs_validation_df, ml_outputs_validation_df = kaggle_split_dataset(validation_df)
        ml_inputs_test_df, _ = kaggle_split_dataset(test_df)
        DL_INPUT_SHAPE = [ml_inputs_training_df.shape[1]]
        print('Main: ml_inputs_training_df')
        print(ml_inputs_training_df.head())
        print('Main: ml_outputs_training_df')
        print(ml_outputs_training_df.head())
        print('Main: ml_inputs_validation_df')
        print(ml_inputs_validation_df.head())
        print('Main: ml_outputs_validation_df')
        print(ml_outputs_validation_df.head())
        print('Main: ml_inputs_test_df')
        print(ml_inputs_test_df.head())

        # Checking models
        scoring = None
        if OUTPUT_IS_REGRESSION: # Use regression algorithms
            scoring = 'r2' # 'r2' or 'neg_mean_absolute_error'
            if not flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                models.append(('LR', linear_model.LinearRegression()))
                models.append(('SGDC', linear_model.SGDRegressor(random_state = SEED_HARCODED_VALUE)))
                models.append(('LASSO', linear_model.Lasso()))
                models.append(('EN', linear_model.ElasticNet()))
                models.append(('KNN', neighbors.KNeighborsRegressor(n_neighbors = 8)))
                models.append(('CART', tree.DecisionTreeRegressor(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
                models.append(('LGBMR', lgb.LGBMRegressor(n_estimators = 1024, num_leaves = 128, max_depth = 32, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
                models.append(('XGB', xgb.XGBRegressor(n_estimators = 1024, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
                models.append(('BGK', ensemble.GradientBoostingRegressor(n_estimators = 1024, max_depth = 32, random_state = SEED_HARCODED_VALUE)))
                models.append(('RF', ensemble.RandomForestRegressor(n_estimators = 1024, max_depth = 32, max_features = 4, random_state = SEED_HARCODED_VALUE)))
                models.append(('SVR', svm.SVR()))
            if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG or flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                models.append(('NNR', KerasRegressor(build_fn = kaggle_create_sequential_regressor_model, nb_epoch = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync, verbose = 1)))
            # TODO Add support of regressors!
        else: # Use classifier algorithms
            scoring = 'accuracy'
            if not flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                models.append(('LR', linear_model.LogisticRegression(random_state = SEED_HARCODED_VALUE)))
                models.append(('SGDC', linear_model.SGDClassifier(random_state = SEED_HARCODED_VALUE)))
                models.append(('LDA', discriminant_analysis.LinearDiscriminantAnalysis()))
                models.append(('KNN', neighbors.KNeighborsClassifier(n_neighbors = 8)))
                models.append(('CART', tree.DecisionTreeClassifier(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
                models.append(('LGBMC', lgb.LGBMClassifier(n_estimators = 1024, num_leaves = 128, max_depth = 32, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
                models.append(('XGB', xgb.XGBClassifier(n_estimators = 1024, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
                models.append(('BGK', ensemble.GradientBoostingClassifier(n_estimators = 1024, max_depth = 32, random_state = SEED_HARCODED_VALUE)))
                models.append(('RF', ensemble.RandomForestClassifier(n_estimators = 1024, max_depth = 32, max_features = 4, random_state = SEED_HARCODED_VALUE)))
                models.append(('NB', naive_bayes.GaussianNB()))
                models.append(('SVC', svm.SVC(random_state = SEED_HARCODED_VALUE)))
            if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG or flags & ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_ONLY_NEURAL_NETWORK_FLAG:
                models.append(('NNC', KerasClassifier(build_fn = kaggle_create_sequential_classifier_model, nb_epoch = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync, verbose = 1)))
                # TODO Add support of classifiers!

        # Check models
        results, names = kaggle_check_models(models, ml_inputs_training_df, ml_outputs_training_df, p_cross_validation = 's-k-fold', p_scoring = scoring)
        best_alg = kaggle_compare_algorithms_perf(names, results, 'Algorithms Comparison', 'Algorithms', 'Accuracy')
        ml = kaggle_algorithm_tuning(models[best_alg], ml_inputs_training_df, ml_outputs_training_df, (ml_inputs_validation_df, ml_outputs_validation_df), p_strategy = strategy)
    else:
        if OUTPUT_IS_REGRESSION: # Use regression algorithms
            with strategy.scope():
                pass # TODO
            # End of 'with' statement
        else:
            # Using kaggle_check_models() and KerasClassifier require too many processing time, just use standard fit() method
            with strategy.scope():
                models.append(('VGG19', kaggle_create_sequential_classifier_model(
                                                                                  #strategy
                                                                                  #p_input_shape = DL_INPUT_SHAPE, 
                                                                                  #p_drop_rate = DL_DROP_RATE,
                                                                                  p_loss = 'sparse_categorical_crossentropy', 
                                                                                  p_metrics = ['sparse_categorical_accuracy'],
                                                                                  p_pretrained_layers = ['VGG19'], 
                                                                                  p_image_size = IMAGE_SIZE, 
                                                                                  p_class_num = len(IMAGE_CLASSES)
                                                                                  )))
                models.append(('ResNet50', kaggle_create_sequential_classifier_model(
                                                                                     #strategy
                                                                                     #p_input_shape = DL_INPUT_SHAPE, 
                                                                                     #p_drop_rate = DL_DROP_RATE,
                                                                                     p_loss = 'sparse_categorical_crossentropy', 
                                                                                     p_metrics = ['sparse_categorical_accuracy'],
                                                                                     p_pretrained_layers = ['ResNet50'], 
                                                                                     p_image_size = IMAGE_SIZE, 
                                                                                     p_class_num = len(IMAGE_CLASSES)
                                                                                     )))
        with strategy.scope():
            print('----------------------------- kaggle_check_dl_models -----------------------------')
            results = []
            histories = []
            trained_models = []
            for name, model in models:
                print('kaggle_check_dl_models: Processing %s with type %s' % (name, type(model)))
                result, history, trained_model = kaggle_check_dl_model(strategy, model, train_df, train_df_size, validation_df, validation_df_size)
                results.append(result)
                histories.append(history)
                trained_models.append(trained_model)
                print('kaggle_check_dl_models: result: ', result)
                #print('kaggle_check_dl_models: history: ', history)
                break # For debug
                # End of 'for' statement
            ml = trained_models[0]
            # TODO Add support of regressor!
        # End of 'with' statement
    # End of building models and training operation

    # Evaluate the model with the validation dataset
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        y_predictions = kaggle_validation_prediction(ml, ml_inputs_validation_df, ml_outputs_validation_df)
        kaggle_explore_ml(ml, ml_inputs_validation_df, y_predictions)
    else:
        dataset = copy_validation_df.unbatch().batch(10)
        batch = iter(dataset)
        images, labels = next(batch)
        probabilities = kaggle_prediction(ml, images)
        y_predictions = np.argmax(probabilities, axis=-1)
        kaggle_display_batch((images, labels), y_predictions)

    # Test the model with unseen images
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        y_predictions = kaggle_prediction(ml, ml_inputs_test_df)
        kaggle_explore_ml(ml, ml_inputs_test_df, y_predictions)
    else:
        images_df = test_df.map(lambda image, idnum: image)
        probabilities = kaggle_prediction(ml, images_df)
        y_predictions = np.argmax(probabilities, axis = -1)
        print(y_predictions)

    # Save the result for the Kaggle compete
    print('Main: Save Kaggel compete submission')
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        if not TEST_FILE_NAME is None:
            kaggle_save_result(ml, 'id', y_predictions, TEST_FILE_NAME, '/kaggle/working/submission.csv')
        else:
            print('Main: TEST_FILE_NAME is None, cannot save Kaggle compete submission')
    else:
        kaggle_save_result_dl(ml, ['id', 'label'], y_predictions, test_df, test_df_size, '/kaggle/working/submission.csv')
        
    # Save the model and try to reload it
    if not flags & ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_CNN_NEURAL_NETWORK_FLAG:
        print('Main: Save the model')
        file_name = ml.__class__.__name__
        kaggle_save_model(ml, '/kaggle/working/', file_name)
        print('Main: Reload the model')
        ml = kaggle_load_model('/kaggle/working/', file_name)
        y_predictions = kaggle_validation_prediction(ml, ml_inputs_validation_df, ml_outputs_validation_df)
        kaggle_explore_ml(ml, ml_inputs_validation_df, y_predictions)

    kaggle_post_main()

    print('Main: End of processing')
    # End of function kaggle_main

Ouf, now, we can execute all the sequences described above and get some results:

In [None]:
# Entry point
print("Starting at ", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
kaggle_main()
print("Ending at ", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))

***If you liked this Notebook, please upvote.
Gives Motivation to make new Notebooks :)***