As a beginner, I made this notebook to present a generic approach to "play" with the concepts of machine learning and neural network. I have also tried to provide some clean Python code. To sum up, it is a synthesis of my current knowledge (and sorry for my English).

This is a first version which will be improved compete after compete.

The basic steps to define a 'Generic approach of Machine Learning' are:
1. Define the problem, I mean understand the data you got and define what are the inputs (attributes) and what is the output (target) of your Machine Learning
2. Summarize the dataset content in a statistical form
3. Prepare the dataset for your Machine Learning processing
4. Evaluate a set of algorithms based on you understanding of the data
5. Improve the results of your Machine Learning by refining the algorithms
6. Present the results of your Machine Learning
7. Deploy or save your Machine Learning

**NOTE: Please, feel free to correct and enhance this notebook ;)**

To define the problem, we have first to choose the subject we will work on. The point 1.b provides different datasets we can use to play. For each dataset, a comment describes the problem to address. 
We will consider two different problems:
1. One about classification (the basic one is the Iris classification)
2. One about regression (Melbourne housing prices)

This is the part that cannot be generic. The generic behavior proposed here is parameterized by the set of parameters defined in point b.1.

Note: In point b.1, to swith to another problem, just comment the current one and uncomment the problem to play with

Switching to Python code, the first step is to load all the required libraries (1.a) and to choose the problem to solve, let's say Iris classification.

In [None]:
from __future__ import division # Import floating-point division (1/4=0.25) instead of Euclidian division (1/4=0)

# 1. Prepare Problem

# a) Load libraries
import os, warnings, argparse, io, operator, requests
from datetime import datetime

import numpy as np # Linear algebra
import matplotlib.pyplot as plt # Data visualization
import seaborn as sns  # Enhanced data visualization
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)

from pandas_profiling import ProfileReport

import sklearn
from sklearn import model_selection
from sklearn import linear_model # Regression
from sklearn import discriminant_analysis
from sklearn import neighbors # Clustering
from sklearn import naive_bayes
from sklearn import tree # Decisional tree learning
from sklearn import svm # Support Vector Machines
from sklearn import ensemble # Support RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor

import xgboost as xgb # Gradient Boosted Decision Trees algorithm

import lightgbm as lgb # Light Gradient Boost model

from sklearn.base import is_classifier, is_regressor # To check if the model is for regression or classification (do not work for Keras)

from sklearn.impute import SimpleImputer 

from sklearn.preprocessing import LabelEncoder # Labelling categorical feature from 0 to N_class - 1('object' type)
from sklearn.preprocessing import LabelBinarizer # Binary labelling of categorical feature
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import StandardScaler # Data normalization
from sklearn.preprocessing import MinMaxScaler # Data normalization
from sklearn.preprocessing import MaxAbsScaler # Data normalization

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import GridSearchCV

from sklearn.inspection import permutation_importance

import pickle # Use to save and load models

import eli5
from eli5.sklearn import PermutationImportance

# Neural Network
import tensorflow as tf
import keras
from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor

First of all, we have to define the problem:
1. Understand the data, see point b.1) below
2. Prepare the basics of your code such as loading the libraries and your data, see points a) and b) below

In point b.1, we have a set of parameters strongly linked to the problem to solve. These parameters are used to configure the execution of 'Generic approach of Machine Learning':
- DATABASE_NAME: The URI of the dataset
- COLUMNS_LABEL: Columns label of the dataset. Default: None, means that labels are already present in the loaded dataset- COLUMNS_TO_DROP: The useless columns to drop after loading the dataset
- FEATURES_SELECTION: The list of the features for the ML inputs. Default: None, means - all columns (excepted the output columns) are concidered as features
- TARGET_COLUMNS: The output columns
- OUTPUT_IS_REGRESSION: Indicates if the ML is about either regression (True) or classification (False)
- DATE_TIME_COLUMNS: The list of the date/time column in customized format such as string format
- FEATURES_POST_LOAD_PROCESSING: This dictionary attaches a lamdba function to apply to a column. The Lambda function is a processing to apply to the column just after loading the dataset (point 1.c).
- FEATURES_PROCESSING: This dictionary attaches a lamdba function to apply to a column. The Lambda function is a processing to apply to the column just before to start data engineering (point 3.a).
- FEATURES_CREATION:  This dictionary attaches a lamdba function for features creation. The Key shall be the name of the new column
- FEATURES_DELETION: This list contains the features to be removed after FEATURES_CREATION processing
- NON_TRANSFORMABLE_COLUMNS: Indicates a list of columns which shall not be included in the transformation process (point 3.b)

In [None]:
# b) Helpers

# b.1) Define global parameters
# Regression

# Jan Tabular Playground Competition
DATABASE_NAME = None
COLUMNS_LABEL = None
COLUMNS_TO_DROP = ['id'] # Id is useless
FEATURES_SELECTION = None
TARGET_COLUMNS = ['target']
OUTPUT_IS_REGRESSION = True
DATE_TIME_COLUMNS = None
FEATURES_POST_LOAD_PROCESSING = None
FEATURES_PRE_PROCESSING = None
FEATURES_CREATION = None
FEATURES_PROCESSING = None
FEATURES_DELETION = None
NON_TRANSFORMABLE_COLUMNS = None

# To predict house price using the famous Melbourne housing dataset
#DATABASE_NAME = 'https://raw.githubusercontent.com/nagoya-foundation/r-functions-performance/master/data/Melbourne_housing_FULL.csv'
#COLUMNS_LABEL = None
#COLUMNS_TO_DROP = ['Address', 'Method', 'Postcode', 'CouncilArea', 'Propertycount', 'Regionname', 'SellerG']
#FEATURES_SELECTION = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
#TARGET_COLUMN = 'Price'
#OUTPUT_IS_REGRESSION=True
#DATE_TIME_COLUMNS = ['Date']
#FEATURES_PRE_PROCESSING = None
#FEATURES_CREATION = None
#FEATURES_DELETION = None
#NON_TRANSFORMABLE_COLUMNS = None
# Suburb
# Address
# Rooms
# Type
# Price
# Method
# SellerG
# Date
# Distance
# Postcode
# Bedroom2
# Bathroom
# Car
# Landsize
# BuildingArea
# YearBuilt
# CouncilArea
# Lattitude
# Longtitude
# Regionname
# Propertycount

# Classification
# To categorize an iris flower according to the dimensions of its sepals & petals 
# Famous database; from Fisher, 1936
#DATABASE_NAME = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'class'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#COLUMNS_LABEL = ['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class']

# To predict survival on the Titanic
#DATABASE_NAME = 'https://raw.githubusercontent.com/alexisperrier/packt-aml/master/ch4/titanic.csv'
#COLUMNS_LABEL = None
#COLUMNS_TO_DROP = ['PassengerId', 'Name', 'Ticket'] # PassengerId is useless, Name and Ticket will be processed in future version
#    # We assume that Name,Ticket and are not relevant information
#    # This can be confirm by the correlation matrix
#FEATURES_SELECTION = None
#TARGET_COLUMNS = ['Survived']
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_POST_LOAD_PROCESSING = {
#    'Cabin': lambda p_value : p_value[0:1] if not p_value is np.NaN else 'U', 
#        # Create a category U for Unknown and just keep the deck indetifier
#}
#FEATURES_PROCESSING = {
#    'Embarked': lambda p_value : p_value[0:1] if not p_value is np.NaN else 'S', 
#        # S has the higher cardinality (see kaggle_summurize_data: distribution of categorical features)
#}
#FEATURES_CREATION = {
#    'FamilySize': lambda p_df : p_df['SibSp'] + p_df['Parch'] + 1,
#    'AgeClass': lambda p_df : 'Senior' if p_df['Age'] >= 60 else 'Adult' if p_df['Age'] >= 35 else 'Young Adult' if p_df['Age'] >= 25 else 'Teen' if p_df['Age'] >= 14 else 'Child' if p_df['Age'] >= 4 else 'Baby',
#        # Create class of ages based on common Age distribution
#    'FareClass': lambda p_df : 'Very Expensive' if p_df['Fare'] >= (3*512/4) else 'Expensive' if p_df['Fare'] >= (512/2) and p_df['Fare'] < (3*512/4) else 'Chip' if p_df['Fare'] < (512/2) and p_df['Fare'] >= (512/4) else 'Very Chip' # Create Age class, NaN values will be imputed
#        # Create class of fare based on discussion below
#}
#FEATURES_DELETION = ['SibSp', 'Parch', 'Age', 'Fare'] # SibSp and Parch were repaced by FamilySize, Age by AgeClass and Fare by FareClass
#NON_TRANSFORMABLE_COLUMNS = None
#  PassengerId: Unique passenger id
#  Survived: Survival status ('Yes' or 'No')
#  Pclass: The class the passeger belong (1st, 2nd or 3rd class)
#  Name: Name of the passenger
#  Sex: The sex of the passenger ('male' of 'female')
#  Age: The age of the passenger (in years)
#  SibSp: # of siblings / spouses aboard the Titanic
#  Parch: # of parents / children aboard the Titanic
#  Ticket: No description available for this field, perhaps the travel company identifier
#  Fare: Ticket price
#  Cabin: Identifier of the cabin. The first character identifies the deck.
#         This could be interesting fo the ML, creating a new feature Deck
#  Embarked: Port of Embarkation

# This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ve years.
# NOTE: Disable flag DATA_CLEANING_FLAG, this dataset is already ready to be used by ML 
#DATABASE_NAME = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'class'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_PRE_PROCESSING = None
#COLUMNS_LABEL = ['preg', 'plas', 'pres (mm Hg)', 'skin (mm)', 'test (mu U/ml)', 'mass', 'pedi', 'age (years)', 'class']
#  preg = Number of times pregnant
#  plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
#  pres = Diastolic blood pressure
#  skin = Triceps skin fold thickness (mm)
#  test = 2-Hour serum insulin (mu U/ml)
#  mass = Body mass index (weight in kg/(height in m)^2)
#  pedi = Diabetes pedigree function
#  age = Age (years)
#  class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

# This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row
#DATABASE_NAME = 'https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.data.csv'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'No-show'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_PRE_PROCESSING = None
#COLUMNS_LABEL = None

Before to load and to examine our dataset, we are just going to set a number of defaults such as the settings for the plotting operation, Deep Learning parameters... (point b.2)

Note: Point b.3 shall be used if you want to 'reassemble the notebook code and create a standalone Python scrypt

In [None]:
# b.2) Set some defaults
def kaggle_set_mp_default() -> None:
    """
    Some default setting for Matplotlib plots
    """
    warnings.filterwarnings("ignore") # to clean up output cells
    pd.set_option('precision', 3)
    # Set Matplotlib defaults
    plt.rc('figure', autolayout=True)
    plt.rc('axes', labelweight='bold', labelsize='large', titleweight='bold', titlesize=18, titlepad=10)
    plt.rc('image', cmap='magma')
    # End of function set_mp_default

# Basic Deep Learning parameters
DL_BATCH_SIZE = 32
DL_EPOCH_NUM = 128
DL_DROP_RATE = 0.3

# Fix random values for reproductibility
SEED_HARCODED_VALUE = 0

def kaggle_set_seed(p_seed: int = SEED_HARCODED_VALUE) -> None:
    """
    Random reproducability
    :parameter p_seed: Set the seed value for random functions
    """
    np.random.seed(p_seed)
    sklearn.utils.check_random_state(p_seed)
    tf.random.set_seed(p_seed)
    os.environ['PYTHONHASHSEED'] = str(p_seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    # End of function set_seed

def kaggle_modules_versions() -> None:
    """
    Print the different modules version
    """
    print('----------------------------- modules_versions -----------------------------')
    print("Numpy version: " + np.__version__)
    print('seaborn: %s' % sns.__version__)
    print("Pandas version: " + pd.__version__)
    print("Sklearn version: " + sklearn.__version__)
    print("Tensorflow version: " + tf.__version__)
    print('modules_versions: Done')
    # End of function modules_versions
    
def kaggle_tpu_detection():
    """
    TPU detection
    :return: The appropriate distribution strategy
    """
    print('----------------------------- kaggle_tpu_detection -----------------------------')
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
        print('kaggle_tpu_detection: Running on TPU ', tpu.master())
    except ValueError:
        tpu = None
    if tpu:
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
    else:
        strategy = tf.distribute.get_strategy() 
    print('kaggle_tpu_detection: %s' % str(strategy.num_replicas_in_sync))
    print('kaggle_tpu_detection: %s' % str(type(strategy)))
    print('kaggle_tpu_detection Done')
    return strategy
    # End of function kaggle_tpu_detection

# b.3) Set execution control flags
from enum import IntFlag

class ExecutionFlags(IntFlag):
    """
    This class provides some execution control flags to enable/disable some part of the whole script execution
    """
    NONE                     = 0b00000000 # All flags disabled
    ALL                      = 0b11111111 # All flags enabled
    DATA_STAT_SUMMURIZE_FLAG = 0b00000001 # Enable statitistical analyzis
    DATA_VISUALIZATION_FLAG  = 0b00000010 # Enable data visualization
    DATA_CLEANING_FLAG       = 0b00000100 # Enable data cleaning (feature engineering)
    DATA_TRANSFORM_FLAG      = 0b00001000 # Enable data transformation
    USE_NEURAL_NETWORK_FLAG  = 0b00010000 # Enable neural network models
    # End of class ExecutionFlags

Now, we are ready to load our dataset and examine it to understand the data it contains. This function accept any URI (e.g. file:///... or http://... or https://...).

Loading the dataset, you can specify or overwrite columns labels.

According to the data analyzing, you can also define some post loading processing using lambda function (see FEATURES_POST_LOAD_PROCESSING).

The function kaggle_load_datasets() splits the data into three datasets:
- Training dataset used to train the model(size fixed by p_train_size, default is 90%)
- Test dataset use to test the mode with unseen data (size is (100 - p_train_size), default is 10%)
- Training dataset is splitted again into Training dataset (80%) and  Validation dataset used to fit the model (size is 20%)

Note: The Test dataset does not contain taget features (see TARGET_COLUMNS)

In [None]:
# c) Load dataset
def kaggle_load_datasets(p_url: str, 
                         p_labels: list = None, 
                         p_train_path: str = None, 
                         p_validation_path: str = None,
                         p_train_size: float = 0.9,
                         p_seed:int = SEED_HARCODED_VALUE
                        ) -> pd.core.frame.DataFrame:
    """
    This function load the dataset specified by p_url or (p_train_path, p_validation_path) ina case of Kaggle compete.
    It also add the labels if required and apply post load processing of the datatsets if required
    :parameters p_url: The URI of the dataset (http:// or file://)
    :parameters p_labels: The label of the columns to be used. Default: None
    :parameters p_train_path: Kaggle specific path for train dataset
    :parameters p_validation_path: Kaggle specific path for validation dataset
    :return: Three datasets: The Training, Test and Validation datasets
    :exception: Raised if specified link is not correct
    """
    print('----------------------------- kaggle_load_datasets -----------------------------')
    df = None
    train_df = None
    test_df = None
    validation_df = None 
    if not p_train_path is None and not p_validation_path is None:
        # Kaggle compete spcific
        train_df = pd.read_csv(p_train_path)
        test_df = pd.read_csv(p_validation_path)
        # Set labels
        if not p_labels is None:
            df.columns = p_labels
        # Split train_df into Training and Test datasets
        y_train_df = train_df[TARGET_COLUMNS]
        x_train_df = train_df.drop(TARGET_COLUMNS, axis = 1)
        X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(x_train_df, y_train_df, train_size = 0.8, random_state = p_seed)               
        train_df = pd.concat([X_train, Y_train], axis = 1)
        validation_df = pd.concat([X_validation, Y_validation], axis = 1)       
    else:
        # Get the data
        if p_url.startswith('file://'):
            df = pd.read_csv(p_url[7:])
        elif p_url.startswith('http'):
            ds = requests.get(p_url).content
            df = pd.read_csv(io.StringIO(ds.decode('utf-8')))
        if df is None:
            raise Exception('kaggle_load_datasets: Failed to load data frame', 'url=%s' % (url))
        # Set labels
        if not p_labels is None:
            df.columns = p_labels
        # Split them into Training, Test and Validation datasets
        y_df = df[TARGET_COLUMNS]
        x_df = df.drop(TARGET_COLUMNS, axis = 1)
        X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x_df, y_df, train_size = p_train_size, random_state = p_seed)
        train_df = pd.concat([X_train, Y_train], axis = 1)
        test_df = X_test

        y_df = train_df[TARGET_COLUMNS]
        x_df = train_df.drop(TARGET_COLUMNS, axis = 1)
        X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(x_df, y_df, train_size = 0.8, random_state = p_seed)
        train_df = pd.concat([X_train, Y_train], axis = 1)
        validation_df = pd.concat([X_validation, Y_validation], axis = 1)

    #print('----------------------------- kaggle_load_datasets: training dataset')
    #print(train_df.describe().T)
    #print('----------------------------- kaggle_load_datasets: test dataset')
    #print(test_df.describe().T)
    #print('----------------------------- kaggle_load_datasets: validation dataset')
    #print(validation_df.describe().T)
    
    # Apply post processing after loading dataset
    if not FEATURES_POST_LOAD_PROCESSING is None and isinstance(FEATURES_POST_LOAD_PROCESSING, dict):
        for key in FEATURES_POST_LOAD_PROCESSING.keys():
            if key in train_df:
                train_df[key] = train_df[key].apply(FEATURES_POST_LOAD_PROCESSING[key])
            if key in validation_df:
                validation_df[key] = validation_df[key].apply(FEATURES_POST_LOAD_PROCESSING[key])
            if key in test_df:
                test_df[key] = test_df[key].apply(FEATURES_POST_LOAD_PROCESSING[key])
        # End of 'for' statement

    # Drop columns if any
    if not COLUMNS_TO_DROP is None:
        train_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
        validation_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
        test_df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)

    print('----------------------------- kaggle_load_datasets: training dataset')
    print(train_df.head())
    print('----------------------------- kaggle_load_datasets: validation dataset')
    print(validation_df.head())
    print('----------------------------- kaggle_load_datasets: test dataset')
    print(test_df.head())

    print('kaggle_load_datasets: Done: %s' % (p_url if not p_url is None else p_train_path))
    return train_df, validation_df, test_df
    # End of function kaggle_load_datasets

Examining the dataset means get a global overview of its data from statistical point of view, using:
1. Some basics statistical tools such as means, stds, quartiles and correlation (2.a)
2. Some visualization tools such as histograms, density plots (2.b)

Understanding the data is the most important step. The kaggle_summurize_data() function provide you a lot of information to help you in this task:
- Dataset info: It provides information about the structure of the data:
1) The number of features (or attributes or columns), and the name (or label) of each. Here, it is important to understand what each feature means, what can be the values for this feature, take care of the units... A lot of research work to understand our problem,
2) The types of each feature. 'object' type indicates categorical features, it means we should have to do some imputations,
3) One or several of these feature will be our ML output and some of them could be removed later because of poor interest to solve our problem (e.g. features with huge correlation, feature reduction using ACP...),
3) The number of observations (or samples) in the dataset. This will be useful to split our datatset into training, validation and test dataset.
- Dataset columns labels: It indicates the name (or label) of each attributes
- Means: It provides you the mean value for each features (also provided by statistical abstract, see below)
- Dataset statistical abstract: It provides, for each feature, basic statistical metrics such as means, stds, quartiles...
- Dataset Head: It displays the fisrt samples of the dataset. It provides you some indication of the value of each observation. Note that it is not suffisient to detect specific values such as NULL or NaN values, zeros, string values, categorical values... 
- Unique possible columns: It provides, for each feature, the list of the unique values. This will help you during the data transformation to rescale and center the feature values (see point 3.c). Very often, a feature with few unique values (e.g. 2 or 3) indicates also a categorical fetaure,
- Correlation table: It provides the correlation between all couple of features and the list of the correlation values in the range > 0.7 and < -0.7. The will be used to reduce the number of features due to strong link between some features (see p_correlation_threshold parameter)

Note: Here we use pandas_profiling to generate an analyze report in HTML format. This report is higly valuable because of the information it provides for each columns:
1. Specific value indicators such as zeros, NaN...
2. Distincts values
3. Statistical values such as mean, min/max...

In [None]:
# 2. Summarize the dataset content in a statistical form
# a) Descriptive statistics
def kaggle_summurize_data(p_df: pd.core.frame.DataFrame, p_correlation_threshold: float = 0.7) -> None:
    """
    This function provides a statistical view of the current dataset
    :parameters p_df: The dataset handle
    """
    print('----------------------------- kaggle_summurize_data -----------------------------')
    # General information
    print('Dataset info:')
    print(p_df.info())
    print('----------------------------- kaggle_summurize_data: Dataset columns labels:')
    print(p_df.columns)
    print('----------------------------- kaggle_summurize_data: Means:')
    print(p_df.mean())
    print('----------------------------- kaggle_summurize_data: Dataset statistical abstract:')
    print(p_df.describe().T)
    print('----------------------------- kaggle_summurize_data: Dataset Head:')
    print(p_df.head(20))
    # NaN values
    print('----------------------------- kaggle_summurize_data: NaN values distribution:')
    print(p_df.isnull().sum().sort_values(ascending = False))
    print("----------------------------- kaggle_summurize_data: Number of rows with NaN: ", p_df.isnull().any(axis = 1).sum())
    # Zeros per columns
    print('----------------------------- kaggle_summurize_data: Zeros per columns distribution:')
    for column in p_df.columns:
        if p_df[column].dtype == 'int64' or p_df[column].dtype == 'float64':
            zeros = p_df[column].isin([0]).sum()
            s = p_df[column].sum()
            print('{}: {}'.format(column, zeros, 100 * zeros / s))
        else:
            print('%s: Not numerical column' % column)
    # End of 'for' statement
    # Distribution of categorical features
    print('----------------------------- kaggle_summurize_data: Distribution of categorical features:')
    categorical_columns = [col for col in p_df.columns if p_df[col].dtype == 'object']
    for c in categorical_columns:
        print('Distribution  for %s' % c)
        print(p_df[c].describe())
    # End of 'for' statement
    # Distribution of categorical features
    print('----------------------------- kaggle_summurize_data: Distribution of numerical features:')
    numerical_columns = [col for col in p_df.columns if p_df[col].dtype == 'int64' or p_df[col].dtype == 'float64']
    for c in numerical_columns:
        print('Distribution  for %s' % c)
        print(p_df[c].describe())
    # End of 'for' statement
    #  Unique possible columns
    print('----------------------------- kaggle_summurize_data: Unique possible columns:')
    for column in p_df.columns:
        print('{}: {}'.format(column, p_df[column].unique()))
    # End of 'for' statement
    # Build Correlation matrix
    print('----------------------------- kaggle_summurize_data: Correlation table:')
    print(p_df.corr(method = 'pearson'))
    # Extract correlation > 0.7 and < -0.7
    print('----------------------------- kaggle_summurize_data: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    corr = p_df.corr().unstack().reset_index() # Group together pairwise
    corr.columns = ['var1', 'var2', 'corr'] # Rename columns to something readable
    print(corr[ (corr['corr'].abs() > p_correlation_threshold) & (corr['var1'] != corr['var2']) ] )
    # Finally, create Pandas Profiling
    #print('----------------------------- kaggle_summurize_data: Pandas Profiling:')
    #file = ProfileReport(p_df) # Need to many times
    #file.to_file('./eda.html')
    #file.to_notebook_iframe()
    print('kaggle_summurize_data: Done')
    # End of function kaggle_summurize_data

The kaggle_visualization() function provides different plot to explore the data distrubution (gaussian, exponecial...) and to detect outlier values. It will help 1) during the data cleaning and 2) later, to choose the ML algorithms (e.g. Outliers do not affect a tree-based algorithm).
There are two kind of data visualition:
- The Univariate Plots which are related to each features, and
- The Multivariate Plots which are related to interaction between features

The Univariate Plots:
- Histograms: It provides a graphical representation of the distribution of a dataset. For a continuous numerical, it show the underlying frequency distribution or the probability  distribution of signal (see https://towardsdatascience.com/histograms-why-how-431a5cfbfcd5)
- Density: It is the continuous form of the histogram (see above) and it shows an estimate of the continuous distribution of a feature (Gaussian distribution, exponential distribution...)

The Multivariate Plots
- Correlationan: It provides indications about the changes between two features
- scatter_matrix: It shows how much one feature is affected by another or the relationship between them

The functions below are some helpers for data visualization. They provides different kind of univariate and multivariate plots. Two special functions provide features vs. target plots and training/validation dataset comparison plots

In [None]:
def create_grid(p_df: pd.core.frame.DataFrame, p_features:list = None, p_nun_plot_per_lane:int = 3) -> list:
    """
    Create the grid in preparation of the plots
    """
    # Create figure
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    sns.set_style('darkgrid')
    l = len(features) // p_nun_plot_per_lane + (1 if len(features) % p_nun_plot_per_lane != 0 else 0)
    #print('==> l=', l)
    fig = plt.figure(figsize = (l * 3, p_nun_plot_per_lane * 3))
    gs = fig.add_gridspec(l, p_nun_plot_per_lane)
    gs.update(wspace = 0.1, hspace = 0.4)
    background_color = '#fbfbfb'
    # Prepare the grid
    fig_desc = dict()
    run_no = 0
    for row in range(0, l):
        for col in range(0, p_nun_plot_per_lane):
            fig_desc['ax' + str(run_no)] = fig.add_subplot(gs[row, col])
            fig_desc['ax' + str(run_no)].set_facecolor(background_color)
            fig_desc['ax' + str(run_no)].tick_params(axis = 'y', left = True)
            fig_desc['ax' + str(run_no)].get_yaxis().set_visible(True)
            for s in ['top', 'right']:
                fig_desc['ax' + str(run_no)].spines[s].set_visible(False)
            run_no += 1
        # End of 'for' statement
    # End of 'for' statement
    #print('==> #plots=', len(fig_desc))
    
    return (fig, gs, fig_desc)
    # End of function create_grid

def finalize_grid(p_figure_desc: list, p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    """
    Finalize the grid after the plot
    """
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    fig, gs, fig_desc = p_figure_desc
    # Add Titles
    fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')
    # Cleanup unused plots
    for t in range(len(features), len(fig_desc)):
        for s in ['top', 'bottom', 'right', 'left']:
            fig_desc['ax' + str(t)].spines[s].set_visible(False)
        fig_desc['ax' + str(t)].tick_params(axis='x', bottom = False)
        fig_desc['ax' + str(t)].get_xaxis().set_visible(False)
        # End of 'for' statement

    plt.show()

    fig = None
    gs = None
    fig_desc = None
    # End of function finalize_grid

def show_counts(p_df: pd.core.frame.DataFrame, p_features:list = None, p_hue:str = None, p_title:str = None, p_comment:str = None) -> None:
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df)
    # Draw plots
    run_no = 0
    for feature in features:
        sns.countplot(p_df[feature], hue = p_hue, ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_distributions

def show_modes(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        try:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], hist = False, color='#ffd100')
        except RuntimeError:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], kde = False, hist = False, color='#ffd100')            
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_distributions

def show_distributions(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        try:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        except RuntimeError:
            sns.distplot(p_df.loc[:,feature], ax = fig_desc['ax' + str(run_no)], kde = False, color='#ffd100')    
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_distributions

def show_trends(p_df: pd.core.frame.DataFrame, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    if p_features is None:
        features = p_df.columns
    else:
        features = p_features
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        sns.lineplot(data = p_df[feature], ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_trends

def show_correlations(p_df: pd.core.frame.DataFrame, p_title:str = None, p_comment:str = None) -> None:
    # Create the grid
    fig = plt.figure(figsize = (3, 3))
    gs = fig.add_gridspec(1, 1)
    background_color = "#fbfbfb"
    # Prepare the grid
    fig_desc = dict()
    fig_desc['ax0'] = fig.add_subplot(gs[0, 0])
    fig_desc['ax0'].set_facecolor(background_color)
    fig_desc['ax0'].tick_params(axis = 'y', left=False)
    fig_desc['ax0'].get_yaxis().set_visible(False)
    for s in ["top", "right", "left"]:
        fig_desc['ax0'].spines[s].set_visible(False)
    # Draw plots
    sns.heatmap(data = p_df.corr(), annot=True)
    # Finalyze the figure
    # Add Titles
    fig_desc['ax0'].text(-0.2, 0.4, p_title, fontsize = 20, fontweight = 'bold', fontfamily = 'serif')
    fig_desc['ax0'].text(-0.2, 0.3, p_comment, fontsize = 13, fontweight = 'light', fontfamily = 'serif')   
    plt.show()
    # End of function show_correlations

def show_outliers(p_df: pd.core.frame.DataFrame, p_title:str = None, p_comment:str = None) -> None:
    features = p_df.columns
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        ds = p_df[feature].value_counts()
        sns.boxplot(ds, ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_outliers

def show_features_vs_target(p_df: pd.core.frame.DataFrame, p_target:str, p_features:list = None, p_title:str = None, p_comment:str = None) -> None:
    if p_features is None:
        features = p_df.columns.tolist() # Using tolist() for removing p_target
    else:
        features = p_features
    if p_target in features:
        features.remove(p_target)
    # Create the grid
    fig, gs, fig_desc = create_grid(p_df[features])
    # Draw plots
    run_no = 0
    for feature in features:
        sns.relplot(x = p_target, y = feature, data = p_df, ax = fig_desc['ax' + str(run_no)], color='#ffd100')
        run_no += 1
    # End of 'for' statement
    # Finalyze the figure
    finalize_grid((fig, gs, fig_desc), p_df, features, p_title, p_comment)
    # End of function show_features_vs_target

In [None]:
# b) Data visualizations
def kaggle_visualization(p_df: pd.core.frame.DataFrame) -> None:
    """
    This method provides different views of the dataset (plot)
    :parameters p_df: The dataset handle
    """
    print('----------------------------- kaggle_visualization_data -----------------------------')
    features = list(set(p_df.columns) - set(TARGET_COLUMNS))
    categorical_columns = [col for col in p_df.columns if p_df[col].dtype == 'object']
    if not DATE_TIME_COLUMNS is None:
        categorical_columns = list(set(categorical_columns) - set(DATE_TIME_COLUMNS))
    numerical_columns = [col for col in p_df.columns if p_df[col].dtype == 'int64' or p_df[col].dtype == 'float64']
    print('kaggle_visualization: Features Distribution plots')
    show_counts(p_df, p_hue = None)
    raise Exception('Stop')
    # Histogram plots
    print('kaggle_visualization: Numerical features Distribution plots')
    show_distributions(p_df, p_features = numerical_columns)
    print('kaggle_visualization: Features outliers plots')
    show_outliers(p_df, p_title = 'Features Distribution')
    #show_trends(p_df, p_title = 'Features Distribution', p_comment = 'All features have bimodal or multimodal distribution')
    print('kaggle_visualization: Histogram of each attributes regarding targets')
    show_correlations(p_df)
    print('kaggle_visualization: Features VS target distribution plots')
    show_features_vs_target(p_df, p_target = TARGET_COLUMNS)#, p_features = numerical_columns)    
    print('kaggle_visualization: Done')
    # End of function kaggle_visualization

def cross_dataset_visualization(p_dfs: list) -> None:

    print('cross_dataset_visualization: Done')
    # End of function cross_dataset_visualization

TODO: Add data analyze results

The function kaggle_ml_quick_and_dirty() provides a 'quick and dirty' evaluation of a ML based on RandomForestClassifier algorithm with estimators parameter set to 128. All rows with NaN values are removed and all categorical attributes are excluded.

In [None]:
# c) Basic ML for a quick & dirty evaluation
def kaggle_ml_quick_and_dirty(p_train_df: pd.core.frame.DataFrame, 
                              p_validation_df: pd.core.frame.DataFrame, 
                              p_test_df: pd.core.frame.DataFrame = None,
                              p_seed:int = SEED_HARCODED_VALUE
                             ) -> np.ndarray:
    """
    This method provides a first ML evalulation based on RandomForest algorithm
    :parameters p_train_df: The Training dataset (to fit the model)
    :parameters p_validation_df: The Training dataset (to validate the model)
    :parameters p_train_df: The Training dataset (to do prediction)
    :parameter p_seed: The seed value
    :return: The machine learning model  
    """
    print('----------------------------- kaggle_ml_quick_and_dirty -----------------------------')
    # Build training & validation datasets
    p = p_train_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    p = p.select_dtypes(exclude=['object'])
    Y_train = p[TARGET_COLUMNS]
    if FEATURES_SELECTION is None:
        X_train = p.drop(TARGET_COLUMNS, axis = 1)
    else:
        X_train = p[FEATURES_SELECTION]

    p = p_validation_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    p = p.select_dtypes(exclude=['object'])
    Y_validation = p[TARGET_COLUMNS]
    if FEATURES_SELECTION is None:
        X_validation = p.drop(TARGET_COLUMNS, axis = 1)
    else:
        X_validation = p[FEATURES_SELECTION]

    # Use classical model
    model = None
    if OUTPUT_IS_REGRESSION:
        model = ensemble.RandomForestRegressor(n_estimators = 128, max_depth = 16, max_features = 4, random_state = p_seed)
    else:
        model = ensemble.RandomForestClassifier(n_estimators = 128, max_depth = 16, max_features = 4, random_state = p_seed)
    # Train the model
    if len(TARGET_COLUMNS) == 1:
        model.fit(X_train, Y_train[TARGET_COLUMNS[0]].ravel())
    else:
        model.fit(X_train, Y_train)
    # Do predictions
    y_predictions = model.predict(X_validation)
    # Get scoring
    if OUTPUT_IS_REGRESSION:
        print('kaggle_ml_quick_and_dirty: Model R2 score=%f' % (r2_score(Y_validation, y_predictions)))
        print('kaggle_ml_quick_and_dirty: : Model Mean absolute error regression loss (MAE): %0.4f' % mean_absolute_error(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: : Model Mean squared error regression loss (MSE): %0.4f' % mean_squared_error(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: : Mean squared error regression loss (RMSE): %0.4f' % np.sqrt(mean_squared_error(Y_validation, y_predictions)))
    else:
        print('kaggle_ml_quick_and_dirty: Model accuracy score: %0.4f' % accuracy_score(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: ROC=%s' %(roc_auc_score(Y_validation, y_predictions)))
        print('kaggle_ml_quick_and_dirty: Model F1 score=%f' % (f1_score(Y_validation, y_predictions)))
        print('kaggle_ml_quick_and_dirty: Confusion matrix: %s' % str(confusion_matrix(Y_validation, y_predictions)))
        print('kaggle_ml_quick_and_dirty: Classification report:\n%s' % str(classification_report(Y_validation, y_predictions)))
    
    # Do prediction with unseen data
    if not p_test_df is None:
        p = p_test_df.copy()
        # Remove NaN values
        p.dropna(axis = 0, inplace = True)
        # Ignore categorical values
        p = p.select_dtypes(exclude=['object'])
        if FEATURES_SELECTION is None:
            X_validation = p
        else:
            X_validation = p[FEATURES_SELECTION]
        y_predictions = model.predict(X_validation)
        # FIXME Evaluate the results?

    print('kaggle_ml_quick_and_dirty: Done')
    return model
    # End of function kaggle_ml_quick_and_dirty

So, the next step is to prepare the data for ML. Usually, you have better result when all the features (features and outputs) are in numerical format (int or float).

1. Feature engineering. It eliminates NULL or NaN values, duplicate values, and it transforms date/time column, categorical columns into numerical fetures. It identifies & handles outliers... (3.a). Categorical columns are usually of type object and the objective here is to transform these categorical columns into numerical columns. Date/time columns can be either object (e.g. date/time in string format) of type datetime64[ns]. For sepcific features such as 'Age', it is possible to create new feature grouping the Age values per range, between from the lower Age value to the upper Age value
2. Data transformation. It applies some numerical transformation such as standardization of features... (3.b)
3. Features selection. It selects and prepares the dataset for the training and the validation (3.c)

In [None]:
# 3. Prepare the dataset for your Machine Learning processing
# a) Data Cleaning
Encoders = dict()
Encoder_Instance = LabelEncoder() # Use global variable for future reverse features engineering
Imputer_Instance = None
def kaggle_features_engineering(p_df: pd.core.frame.DataFrame,                               
                                p_missing_value_method: str = 'drop_columns', 
                                p_duplicated_value_method: str = 'drop_columns', 
                                p_categorical_onehot_threshold: int = 10, 
                                p_date_time_columns: list = None, 
                                p_date_time_engineering: str = 'python_time') -> pd.core.frame.DataFrame:
    """
    This function performs a cleaning of the dataset to remove null values, duplicate values, based on the specified method
    :parameters p_df: The dataset handle
    :parameters p_missing_value_method: The method to cleanup NaN values in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_duplicated_value_method: The method to cleanup duplicated in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_categorical_onehot_threshold: The maximum cardinality to apply OneHotEncoder to a categorical variable. Defaut: 10
    :parameters p_date_time_engineering: The method to convert Date/Time. Default: 'python_time'
    :return: The dataset after the cleanup process
    """
    global Encoders, Encoder_Instance, Imputer_Instance
    
    print('----------------------------- kaggle_features_engineering -----------------------------')
    # Cleanup dataset
    old_shape = p_df.shape
    p = p_df.copy() # The final dataset

    # Apply feature processing resulting of the data analyzing
    if not FEATURES_PROCESSING is None and isinstance(FEATURES_PROCESSING, dict):
        for key in FEATURES_PROCESSING.keys():
            if key in p.columns:
                p[key] = p[key].apply(FEATURES_PROCESSING[key])
        # End of 'for' statement

    # Build the list of categorical and numerical features
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
    print('kaggle_features_engineering: Initial categorical_columns:')
    print(categorical_columns)
    print('kaggle_features_engineering: Initial numerical_columns:')
    print(numerical_columns)
    if p_date_time_columns is not None:
        if len(categorical_columns) != 0:
            categorical_columns = list(set(categorical_columns) - set(p_date_time_columns))
        numerical_columns = list(set(numerical_columns) - set(p_date_time_columns))

    # Convert Date/time columns
    # dtype = 'datetime64[ns]'
    print('----------------------------- kaggle_features_engineering: Processing Date/Time columns')
    if p_date_time_columns is not None: # Process specified columns
        # Check date/time formats
        for column in p_date_time_columns: # TODO Check if all DateTime values have the same format, i.e. same length
            date_lengths = p[column].str.len().value_counts()
            print('kaggle_features_engineering: %s lengths:' % column)
            print('%s - %d' % (str(date_lengths), len(date_lengths)))
        # End of 'for' statement
        p[p_date_time_columns] = p[p_date_time_columns].astype('datetime64[ns]')
        p[p_date_time_columns] = p[p_date_time_columns].astype('int64')
        print('kaggle_features_engineering: Date/time columns processed')
    else:
        print('kaggle_features_engineering: No date/time values')
    # Be sure there is no more 'datetime64[ns]' types in the dataset
    datetime_columns = [col for col in p.columns if p[col].dtype == 'datetime64[ns]']
    if len(datetime_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has datetime64[ns] type in dataset', 'method=%s' % str(p.info()))

    # Find N/A values for categorical columns and replace them by the value with the higher frequency
    print('----------------------------- kaggle_features_engineering: Processing NaN values')
    categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
    if len(categorical_columns_with_nan) != 0:
        print('----------------------------- kaggle_features_engineering: Impute NaN values for categorical columns with MAX value')
        for col in categorical_columns_with_nan:
            p[col].fillna(p[col].value_counts().idxmax(), inplace = True)
        # End of 'for'statement
        # Check that there are no more categorical columns with NaN
        categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
        if len(categorical_columns_with_nan) != 0:
            raise Exception('kaggle_features_engineering: There still has categorical columns with NaN value in dataset', 'method=%s' % str(categorical_columns_with_nan))
    else:
        print('----------------------------- kaggle_features_engineering: No NaN value in categorical columns')
    # Use Imputation to replace NaN in numerical columns
    print('----------------------------- kaggle_features_engineering: Impute NaN values for numerical columns with %s method' % p_missing_value_method)
    numerical_columns_with_nan = [col for col in p.columns if (p[col].dtype == 'int64' or p[col].dtype == 'float64') and p[col].isna().sum() != 0]
    if len(numerical_columns_with_nan) != 0:
        print('kaggle_features_engineering: cols_with_missing: %s' % (str(numerical_columns_with_nan)))
        # Find rows with missing values
        rows_with_null = p[numerical_columns_with_nan].isnull().any(axis=1)
        rows_with_missing = p[rows_with_null]
        print('kaggle_features_engineering: rows_with_missing: %s/%s' % (rows_with_missing.shape[0], p.shape[0]))
        # Impute missimg values
        if p_missing_value_method == 'drop_columns': # Impute removing columns
            p = p.drop(numerical_columns_with_nan, axis = 1)
        elif p_missing_value_method == 'drop_lines' and len(rows_with_null) != 0: # Impute removing rows
            p = p.dropna()
        else: # Imputate using SimpleImputer
            if Imputer_Instance is None:
                if p_missing_value_method == 'mean':
                    Imputer_Instance = SimpleImputer(strategy='mean')
                elif p_missing_value_method == 'median':
                    Imputer_Instance = SimpleImputer(strategy='median')
                else:
                    raise Exception('kaggle_features_engineering: Invalid method', 'method=%s' % (p_missing_value_method))
            # else, nothing to do
            labels = p.columns # Save labels
            Imputer_Instance = Imputer_Instance.fit(p[numerical_columns_with_nan])
            p[numerical_columns_with_nan] = Imputer_Instance.transform(p[numerical_columns_with_nan])
            #p[numerical_columns_with_nan] = pd.DataFrame(Imputer_Instance.fit_transform(p))
            # Restore column names
            p.columns = labels
            print('kaggle_features_engineering: Cleaning NaN values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
            print(p.head())
            numerical_columns_with_nan = [col for col in p.columns if (p[col].dtype == 'int64' or p[col].dtype == 'float64') and p[col].isna().sum() != 0]
            if len(numerical_columns_with_nan) != 0:
                raise Exception('kaggle_features_engineering: There still has numerical columns with NaN value in dataset', 'method=%s' % str(numerical_columns_with_nan))
    else:
        print('kaggle_features_engineering: No missing values in numerical columns')
    print('----------------------------- kaggle_features_engineering: After First round:')
    #print(p.head())
    print(p.describe().T)

    print('----------------------------- kaggle_features_engineering: Features creation/deletion:')
    if not FEATURES_CREATION is None or not FEATURES_DELETION is None:
        # Features creation
        if not FEATURES_CREATION is None:
            for key in FEATURES_CREATION.keys():
                p[key] = p.apply(FEATURES_CREATION[key], axis = 1)
                # End of 'for' statement
        if not FEATURES_DELETION is None:
            p.drop(FEATURES_DELETION, inplace = True, axis = 1)
        categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
        numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
        print('kaggle_features_engineering: Rebuild categorical_columns:')
        print(categorical_columns)
        print('kaggle_features_engineering: Rebuild numerical_columns:')
        print(numerical_columns)
    print('----------------------------- kaggle_features_engineering: After Second round:')
    #print(p.head())
    print(p.describe().T)

    # Search for categorical variables
    print('----------------------------- kaggle_features_engineering: Encoding categorical columns:')
    new_categorical_columns = []
    if len(categorical_columns) != 0:
        print('kaggle_features_engineering: categorical_columns: ' + str(categorical_columns))
        # Compute cardinalities of the categorical vairiables
        categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
        print('kaggle_features_engineering: categorical_columns_cardinalities: ')
        print(categorical_columns_cardinalities)
        print('kaggle_features_engineering: OneHotEncoder thresholds: %d' % p_categorical_onehot_threshold)
        # Apply OneHot encoding to categorical value with very low cardinality
        cols_processed = []
        new_categorical_columns = categorical_columns.copy()
        for i in range(0, len(categorical_columns)):
            if categorical_columns_cardinalities[i] <= p_categorical_onehot_threshold:
                print('kaggle_features_engineering: OneHotEncoder: %s' % categorical_columns[i])
                if not Encoders is None:
                    if not categorical_columns[i] in Encoders:
                        Encoders[categorical_columns[i]] = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
                else:
                    Encoders[categorical_columns[i]] = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
                new_col = Encoders[categorical_columns[i]].fit_transform(pd.DataFrame(p[categorical_columns[i]]))
                new_col = pd.DataFrame(new_col, columns = [(categorical_columns[i] + "_" + str(j)) for j in range(new_col.shape[1])])
                new_col.index = p[categorical_columns[i]].index
                p.drop(categorical_columns[i], inplace = True, axis = 1)
                p = pd.concat((p, new_col), axis = 1)
                cols_processed.append(categorical_columns[i])
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
                new_categorical_columns.extend(new_col.columns.tolist())
            else:
                print('!!!!!!!!!!!!!!!!!!!! kaggle_features_engineering: Cannot process %s' % categorical_columns[i])
                # Just drop them for the time being
                # FIXME To be refined using TargetEncoder
                p.drop(categorical_columns[i], axis = 1, inplace = True)
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
        # End of 'for' statement
        if len(cols_processed) != 0:
            print('kaggle_features_engineering: Encoders applied on %s' % str(cols_processed))
            print('kaggle_features_engineering: New datase structure:')
            #print(p.head())
            print(p.describe().T)
            categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
            print('kaggle_features_engineering: Cleaning categorical values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
            print('kaggle_features_engineering: new Categorical columns:')
            print(categorical_columns)
            # Compute new cardinalities of the categorical vairiables
            categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
            print('kaggle_features_engineering: New categorical_columns_cardinalities: ')
            print(categorical_columns_cardinalities)
        # TODO: Drop categorical variables with extrem cardinalities
        # Encode categorical variables using numerical mapping
        for col in categorical_columns:
            p[col] = Encoder_Instance.fit_transform(p[col].astype(str))
            # End of 'for' statement
            print('kaggle_features_engineering: Labelling:')
            #print(p.head())
            print(p.describe().T)
        # End of 'for' statement
    else:
        print('kaggle_features_engineering: No categorical values')
    # Be sure there is no more 'object' types in the dataset
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    if len(categorical_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has object type in dataset', 'method=%s' % str(categorical_columns))
    print('----------------------------- kaggle_features_engineering: After Third round:')
    #print(p.head())
    print(p.describe().T)

    # Identifying & handling outliers
    print('----------------------------- kaggle_features_engineering: Identifying & handling outliers:')
    for c in numerical_columns:
        q25, q75 = np.percentile(p[c], 25), np.percentile(p[c], 75)
        iqr = q75 - q25
        print('kaggle_features_engineering: IRQ range %f' % iqr)
        # Outlier cutoff threshold
        cut_off = iqr * 1.5
        if not cut_off == 0:
            lower_bound, upper_bound = q25 - cut_off, q75 + cut_off
            print('kaggle_features_engineering: Outliers thresholds for %s: (%f, %f)' % (c, lower_bound, upper_bound))
            outliers = [x for x in p[c] if x < lower_bound or x > upper_bound]
            mean = p[c].mean()
            if len(outliers) != 0:
                print('kaggle_features_engineering: Outliers for %s' % c)
                print(outliers)
                p[c] = p[c].apply(lambda x: mean if x < lower_bound or x > upper_bound else x)
            else:
                print('No outliers for %s' % c)
        else:
            print('No outliers for %s' % c)
    print('----------------------------- kaggle_features_engineering: After Fourth round:')
    #print(p.head())
    print(p.describe().T)
    
#    raise Exception('Stop', 'Stop')

    # Rebuild the list of categorical and numerical features
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    numerical_columns = [col for col in p.columns if p[col].dtype == 'int64' or p[col].dtype == 'float64']
    if p_date_time_columns is not None:
        if len(categorical_columns) != 0:
            categorical_columns = list(set(categorical_columns) - set(p_date_time_columns))
        numerical_columns = list(set(numerical_columns) - set(p_date_time_columns))
    print('----------------------------- kaggle_features_engineering: Categorical columns after: ', categorical_columns)
    print('----------------------------- kaggle_features_engineering: Nunmerical columns after: ', numerical_columns)

    print('kaggle_features_engineering: ', list(new_categorical_columns)) 
    print('kaggle_features_engineering: Done') 
    return p, new_categorical_columns, numerical_columns
    # End of function kaggle_features_engineering

TODO: Add data engineering results

There are different kinds of data transformation:
- Standardization: It removes the mean and scaling to unit variance of the feature (see point 2.a)
- Scaling: It rescales the feature values in a range of 0 and 1

In [None]:
# b) Data Transforms
Transform = dict()
def kaggle_data_transform(p_df: pd.core.frame.DataFrame, p_columns:list = None, p_transform: str = 'standard') -> pd.core.frame.DataFrame:
    """
    Apply data transformation to the provided dataset
    :parameters p_df: The dataset handle
    :parameters p_columns: The columns to apply transformation (e.g. we don't apply transformation on categorical column)
    :parameter p_transform: The type of transormation. Default: 'standard'
                            'standard': Remove the mean and scaling to unit variance
                            'scale': Scale feature to a Min/max range
                            'abs_scale': Scale feature to a range [-1, 1]
    :return: The dataset after features selection
    """
    print('----------------------------- kaggle_data_transform -----------------------------')
    global Transform
    
    p = None
    size = str(p_df.shape[1])
    if not size in Transform:
        if p_transform == 'standard':
            # Standardization, or mean removal and variance scaling
            Transform[size] = StandardScaler()
        elif p_transform == 'scale':
            # Scaling features to a range
            Transform[size] = MinMaxScaler()
        elif p_transform == 'abs_scale':
            # Scaling features to a range
            Transform[size] = MaxAbsScaler()
        else:
            raise Exception('kaggle_data_transform: Wrong parameters', 'p_transform=%s' % p_transform)
        if p_columns is None: # Apply transformamtion to the whole dataset
            p = Transform[size].fit_transform(p_df)
            p = pd.DataFrame(data = p, columns = p_df.columns)
        else:
            p = p_df.copy() 
            p[p_columns] = Transform[size].fit_transform(p_df[p_columns])
    else:
        if p_columns is None: # Apply transformamtion to the whole dataset
            p = Transform[size].transform(p_df)
            p = pd.DataFrame(data = p, columns = p_df.columns)
        else:
            #print('==> p_df[p_columns].shape=', p_df[p_columns].shape)
            #print('==> p_df[p_columns].columns=', p_df[p_columns].columns)            
            p = Transform[size].transform(p_df[p_columns])
            #print('==> p.shape=', p.shape)
            p = pd.DataFrame(data = p, columns = p_df[p_columns].columns)
    print('kaggle_data_transform: Dataset Head:')
    print(p.head())
    
    print('kaggle_data_transform: Done')
    return p
    # End of function kaggle_data_transform

The kaggle_features_selection() function splits the dataset into the input features and the target features.

In [None]:
# b) Features Selection
def kaggle_features_selection(p_df: pd.core.frame.DataFrame, p_correlation_threshold:float = 0.7) -> pd.core.frame.DataFrame:
    """
    Reorganize the dataset to keep only provided attributes, the target column is the last column of the new dataset
    :parameters p_df: The dataset handle
    :parameters p_correlation_threshold: Correlation threshold to calculate lower bound and upper bound for feature removing
    :return: The dataset after features selection
    """
    if not FEATURES_SELECTION is None:
        print('----------------------------- kaggle_features_selection: Features selection is forced, skip it')
        print('kaggle_features_selection: Done')
        return p_df[FEATURES_SELECTION], []

    # Build Correlation matrix
    print('----------------------------- kaggle_features_selection: Correlation table:')
    p = p_df.copy()
    p_corr = p.drop(TARGET_COLUMNS, axis = 1)
    # Extract correlation > 0.7 and < -0.7
    cor_matrix = p_corr.corr(method = 'pearson')
    print('----------------------------- kaggle_features_selection: cor_matrix:')
    print(cor_matrix)
    upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k = 1).astype(np.bool))
    print('----------------------------- kaggle_features_selection: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > p_correlation_threshold)]
    print('kaggle_features_selection: Drop ', to_drop)
    if len(to_drop) != 0:
        # Drop correlated columns
        p.drop(to_drop, axis = 1, inplace = True)
        print('----------------------------- kaggle_features_selection: new dataset:')
        print(p.head())
        print(p.describe().T)

    print('kaggle_features_selection: Done')
    return p, to_drop 
    # End of function kaggle_features_selection

TODO Add features selection comments

After cleaning and transforming the initial dataset, we can use it to train and validate our ML. So, The next step is to shuffle our dataset in three different 'sub-datasets' (point 4.a):
1. The training dataset, used to evaluate the ML models
2. The validation dataset, used to validate the selected model
3. The test dataset, use to test the model

In [None]:
# 4. Evaluate Algorithms
# a) Split-out validation dataset
def kaggle_split_dataset(p_df: pd.core.frame.DataFrame, p_target: list = TARGET_COLUMNS) -> list:
    """
    Split the into input features and target features
    :parameters p_df: The dataset handle
    :parameter p_target The outputs of the Machine Learning
    :return: The list of input features and target features
    """
    print('----------------------------- kaggle_split_dataset -----------------------------')
    y_values = None
    x_values = None
    if set(p_target).issubset(set(p_df.columns)):
        y_values = p_df[p_target]
        x_values = p_df.drop(p_target, axis = 1)
    else:
        x_values = p_df
    
    # Re-order column by column name
    x_values = x_values.reindex(sorted(x_values.columns), axis = 1)
    
    print('----------------------------- kaggle_split_dataset: x_values:')
    print(x_values.head())
    if not y_values is None:
        print('----------------------------- kaggle_split_dataset: y_values:')
        print(y_values.head())

    print('kaggle_split_dataset: Done')
    return x_values, y_values
    # End of function kaggle_split_dataset

Now we can apply different models (linear, non-linear, ensemble...) to build our ML and evaluate their efficiency (4.b)

In [None]:
# b) Check models
def kaggle_check_models(p_models: list, p_inputs_training_df: pd.core.frame.DataFrame, p_outputs_training_df: pd.core.frame.DataFrame, p_kparts: int = 10, p_random_state: int = SEED_HARCODED_VALUE, p_cross_validation: str = 'k-fold', p_scoring: str = 'accuracy') -> list:
    """
    Apply different models to train our Machine Learning and evaluate their efficiency
    :parameter p_models: A list of models to use for to train the Machine Learning
    :parameter p_inputs_training_df: The training inputs dataset (training attributes)
    :parameter p_outputs_training_df: The training output dataset (training target)
    :parameter p_inputs_valid_df: The validation inputs dataset (validation attributes)
    :parameter p_outputs_valid_df: The validation output dataset (validation target)
    :parameter p_kparts: The size of the KFolds
    :parameter p_random_state: 
    :parameter p_cross_validation:  KFold or StratifiedKFold. Default: k-fold
    :parameter p_scoring: 
    :return: The list of couple (result, model name)
    """
    print('----------------------------- kaggle_check_models -----------------------------')
    results = []
    names = []
    for name, model in p_models:
        print('kaggle_check_models: Processing %s with type %s' % (name, type(model)))
        # Create train/test indices to split data in train/test sets
        if p_cross_validation == 'k-fold':
            kfold = model_selection.KFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        elif p_cross_validation == 's-k-fold':
            kfold = model_selection.StratifiedKFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        else:
            raise Exception('kaggle_check_models: Wrong parameters', 'p_cross_validation:%s' % p_cross_validation)
        cv_results = None
        # Evaluate model performance
        if p_outputs_training_df.shape[1] == 1:
            p = p_outputs_training_df[TARGET_COLUMNS[0]].ravel()
        else:
            p = p_outputs_training_df
        if p_cross_validation == 'k-fold' or p_cross_validation == 's-k-fold':
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p, cv = kfold, scoring = p_scoring)
        else:
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p, cv = LeaveOneOut(), scoring = p_scoring)
        print('kaggle_check_models: cv_result=%s' % str(cv_results))
        results.append(cv_results)
        names.append(name)
        msg = 'kaggle_check_models: %s metric: %s: %f (%f)' % (p_scoring, name, cv_results.mean(), cv_results.std())
        print(msg)
        # End of 'for' loop

    print('kaggle_check_models: Done')
    return results, names
    # End of function kaggle_check_models

Then, we select the best model based on the scoring (4.c)

In [None]:
def kaggle_compare_algorithms_perf(p_names: list, p_metrics: list, p_title: str, p_x_label: str, p_y_label:str) -> int:
    """
    Compare and return the model with the best results
    :parameter p_names: The list of models executed
    :parameter p_metrics: The scorimng obtained by each model
    :parameter p_title: Performance plot title
    :parameter p_x_label: Performance plot X-axis label
    :parameter p_y_label: Performance plot Y-axis label
    :return: The index of the model with the higher scoring
    """
    print('----------------------------- kaggle_compare_algorithms_perf -----------------------------')
    # Extract means & std
    means = []
    stds = []
    for i in range (len(p_names)):
        cv_results = p_metrics[i]
        means.append(cv_results.mean())
        stds.append(cv_results.std())
        # End of 'for' statement
    # Display means/standard deviation
    plt.title(p_title)
    plt.xlabel(p_x_label)
    plt.ylabel(p_y_label)
    plt.errorbar(p_names, means, stds, linestyle='None', marker='^')
    #plt.savefig('kaggle_algorithms_comparison.png')
    plt.show()
    # Select the best algorithm
    m = np.array(means)
    maxv = np.amax(m)
    idx = np.where(m == maxv)[0][0]
    print('kaggle_compare_algorithms_perf: Max value: %d:%f +/- %f ==> %s' % (idx, maxv, 2 * stds[idx], p_names[idx]))

    print('kaggle_compare_algorithms_perf: Done')
    return idx
    # End of function kaggle_compare_algorithms_perf

Finally, we will use the GridSearchCV() function to find the best model parameters.

In [None]:
# 5. Improve Accuracy
# a) Algorithm Tuning
def kaggle_algorithm_tuning(p_algorithm: list, p_inputs_training_df: pd.core.frame.DataFrame, p_outputs_training_df: pd.core.frame.DataFrame, p_validation_data: list = None):
    """
    Tuning model parameters
    :parameter p_algorithm: The ML model to tune
    :parameter p_inputs_training_df: The input training tada
    :parameter p_outputs_training_df: The target for the training data
    :parameter p_validation_data: The input validation data
    :return: The tuned model
    """
    print('----------------------------- kaggle_algorithm_tuning -----------------------------')
    model = p_algorithm[1]
    print('----------------------------- kaggle_algorithm_tuning: %s' % model.__class__.__name__)
    print('----------------------------- kaggle_algorithm_tuning: model summary:')
    print(model)
    print(model.get_params())

    # Fit the model
    if model.__class__.__name__ == 'LinearRegression': # No Hyper parameters tuning
        print('kaggle_algorithm_tuning: No Hyper parameters tuning for LinearRegression')
        model.fit(p_inputs_training_df, p_outputs_training_df)
        return model
    elif not model.__class__.__name__.startswith('Keras'):
        # Hyper parameters tuning
        print('----------------------------- kaggle_algorithm_tuning: Hyper parameters tuning')
        if model.__class__.__name__.startswith('SV'):
            param_grid = {
                'C': [0.1, 1, 10, 100], 
                'gamma': [1, 0.1, 0.01, 0.001],
                'kernel': ['rbf', 'poly', 'sigmoid']
            }
        elif model.__class__.__name__.startswith('KNeighbors'):
            param_grid = {
                'n_neighbors': [4, 8, 16, 32], 
                'weights': ['uniform', 'distance'],
                'algorithm': ['ball_tree', 'kd_tree', 'brute']
            }
        elif model.__class__.__name__.startswith('LGBM'):
            param_grid = {
                'num_leaves': [32, 128],
                'reg_alpha': [0.1, 0.5],
                'min_data_in_leaf': [32, 64, 128, 256],
                'lambda_l1': [0, 1, 1.5],
                'lambda_l2': [0, 1],
                'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.4, 0.6],
            }
        else:
            # Global grid parameters
            param_grid = {
                'n_estimators': [256, 512, 1024],
                'max_depth': [16, 24 , 32],
                'max_leaf_nodes': [64, 128, 256],
                'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.4, 0.6],
                #'loss': ['deviance'],
                #'min_samples_split': np.linspace(0.1, 0.5, 3),
                #'min_samples_leaf': np.linspace(0.1, 0.5, 3),
                #'max_features': ['log2', 'sqrt'],
                #'criterion': ['friedman_mse',  'mae'],
                #'subsample': np.linspace(0.5, 1.0, 3),
            }
            # Remove unsupported parameters
            if model.__class__.__name__.startswith('RandomForest'):
                del param_grid['learning_rate']

        tunning = GridSearchCV(
            estimator = model,
            param_grid = param_grid, 
            cv = 5, 
            n_jobs = 5, 
            scoring = 'neg_mean_squared_error',
            verbose = 2
        )
        if p_outputs_training_df.shape[1] == 1:
            model = tunning.fit(p_inputs_training_df, p_outputs_training_df[TARGET_COLUMNS[0]].ravel())
        else:
            model = tunning.fit(p_inputs_training_df, p_outputs_training_df)
        print('----------------------------- kaggle_algorithms_tuning: Hyper parameters tuning ended:')
        print('kaggle_algorithm_tuning: Hyper parameters tuning: best_score_=', model.best_score_)
        print('kaggle_algorithm_tuning: Hyper parameters tuning: best_params_=', model.best_params_)
        print('kaggle_algorithm_tuning: Hyper parameters tuning: best_estimator_=', model.best_estimator_)
        model = model.best_estimator_
    else:
        early_stopping = keras.callbacks.EarlyStopping(patience = 5, min_delta = 0.001, restore_best_weights = True)
        history = model.fit(p_inputs_training_df, p_outputs_training_df, validation_data = p_validation_data, epochs = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync, callbacks = [early_stopping])
        print('----------------------------- kaggle_algorithm_tuning: loss/val_loss plot')
        history = pd.DataFrame(history.history)
        history.loc[:, ['loss', 'val_loss']].plot(title="loss/val_loss")
        print('kaggle_algorithm_tuning: Minimum Validation Loss: {:0.4f}' & history_df['val_loss'].min())
    
    print('kaggle_algorithm_tuning: Done')
    return model
    # End of function kaggle_algorithm_tuning

Now, we reached the point where we can evaluate our model with Validation and/or Test datasets.

In [None]:
# b) Ensembles
# 6. Finalize Model
# a) Predictions on validation dataset
def kaggle_validation_prediction(p_model, p_inputs, p_expected_outputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) and compares outputs against expected outputs (Validation) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs: 
    :parameter p_expected_outputs: 
    """
    print('----------------------------- kaggle_validation_prediction -----------------------------')
    print('kaggle_validation_prediction: model=%s - is_class:%s - is_regr:%s' % (p_model.__class__.__name__, str(is_classifier(p_model)), str(is_regressor(p_model))))
    y_predictions = p_model.predict(p_inputs)
    print('kaggle_validation_prediction: Score: ', round(p_model.score(p_inputs, p_expected_outputs) * 100, 2), " %")
    if is_regressor(p_model) or p_model.__class__.__name__ == 'KerasRegressor': # Regression metrics (continuous target values)
        print('kaggle_validation_prediction: Model R2 score=%f' % (r2_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: : Model Mean absolute error regression loss (MAE): %0.4f' % mean_absolute_error(p_expected_outputs, y_predictions))
        print('kaggle_validation_prediction: : Model Mean squared error regression loss (MSE): %0.4f' % mean_squared_error(p_expected_outputs, y_predictions))
        print('kaggle_validation_prediction: : Mean squared error regression loss (RMSE): %0.4f' % np.sqrt(mean_squared_error(p_expected_outputs, y_predictions)))
        # Analyze residual errors
        plt.scatter(p_expected_outputs, y_predictions)
        plt.show()
        # TODO Interpreting the Cofficients if possible
    elif is_classifier(p_model) or p_model.__class__.__name__ == 'KerasClassifier': # Classification metrics (class target values)
        print('kaggle_validation_prediction: accuracy=%s' %(accuracy_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: Model F1 score=%f' % (f1_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: ROC=%s' %(roc_auc_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: Confusion_matrix:%s' % str(confusion_matrix(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: Classification report:\n%s' % str(classification_report(p_expected_outputs, y_predictions)))
    else:
        raise Exception('kaggle_validation_prediction: Invalid model')
    print('kaggle_validation_prediction: prediction is %s' % (str(y_predictions)))

    print('kaggle_validation_prediction: Done')
    return y_predictions
    # End of function kaggle_validation_prediction

def kaggle_prediction(p_model, p_inputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs:  
    """
    print('----------------------------- kaggle_prediction -----------------------------')
    y_prediction = p_model.predict(p_inputs)
    print('kaggle_prediction: prediction is %s' %(str(y_prediction)))
    print('kaggle_prediction: Done')
    return y_prediction
# b) Create standalone model on entire training dataset
# TODO

The functions below are some helper to save the model and to save our Machine Learning outcomes in Kaggle compete format.

In [None]:
# c) Save model for later use
def kaggle_save_model(p_model, p_paths: str, p_file_name:str) -> None:
    """
    Save the provided model in binary/pickle format and the Encoders/Imputers
    :parameter p_model: The ML model to save
    :parameter p_paths: The path to save the files, ending with a '/' (e.g. ./)
    :parameter p_file_name: The file name woithout extension file (e.g. './MyModel')
    """
    global Encoders, Encoder_Instance, Imputer_Instance, Transform

    print('----------------------------- kaggle_save_model -----------------------------')
    # Serialize the model
    pickle.dump(p_model, open(p_paths + p_file_name + '.pkl', 'wb'))
    print('kaggle_save_model: Done: %s' % (p_file_name + '.pkl'))
    # Save Encoders, Encoder_Instance and Imputer_Instance
    if not Encoders is None and len(Encoders) != 0:
        pickle.dump(Encoders, open(p_paths + 'Encoders.pkl', 'wb'))
    if not Encoder_Instance is None:
        pickle.dump(Encoder_Instance, open(p_paths + 'Encoder_Instance.pkl', 'wb'))
    if not Imputer_Instance is None:
        pickle.dump(Imputer_Instance, open(p_paths + 'Imputer_Instance.pkl', 'wb'))
    if not Transform is None:
        pickle.dump(Transform, open(p_paths + 'Transform.pkl', 'wb'))

    print('kaggle_save_model: Done')
    # End of function kaggle_save_model

In [None]:
# Load model
def kaggle_load_model(p_paths: str, p_file_name:str):
    """
    Load a model in binary/pickle format and the Encoders/Imputers
    :parameter p_model: The ML model to save
    :parameter p_paths: The path to save the files, ending with a '/' (e.g. ./)
    :parameter p_file_name: The file name woithout extension file (e.g. './MyModel')
    """
    global Encoders, Encoder_Instance, Imputer_Instance, Transform

    print('----------------------------- kaggle_load_model -----------------------------')
    try:
        Encoders = pickle.load(open(p_paths + 'Encoders.pkl', 'rb'))
    except:
        Encoders = Dict()
    try:
        Encoder_Instance = pickle.load(open(p_paths + 'Encoder_Instance.pkl', 'rb'))
    except:
        Encoder_Instance = LabelEncoder()
    try:
        Imputer_Instance = pickle.load(open(p_paths + 'Imputer_Instance.pkl', 'rb'))
    except:
        Imputer_Instance = None
    try:
        Transform = pickle.load(open(p_paths + 'Transform.pkl', 'rb'))
    except:
        Transform = None
    model = pickle.load(open(p_paths + p_file_name + '.pkl', 'rb'))
    
    print('kaggle_load_model: Done')
    return model
    # End of function kaggle_load_model

The function kaggle_explore_ml() provides some insights from our ML.

In [None]:
def kaggle_explore_ml(p_model, p_x_validation: pd.core.frame.DataFrame, p_y_validation: pd.core.frame.DataFrame, p_random_state:int = SEED_HARCODED_VALUE) -> None:
    """
    Apply feature importance concept to our ML 
    :parameter p_model: The predictions to save
    """
    print('----------------------------- kaggle_explore_ml -----------------------------')
    result = permutation_importance(p_model, p_x_validation, p_y_validation, n_repeats = 32, random_state = p_random_state)
    sorted_idx = result.importances_mean.argsort()
    print('----------------------------- kaggle_explore_ml: result:')
    print(sorted_idx)

    fig, ax = plt.subplots()
    ax.boxplot(result.importances[sorted_idx].T, vert = False, labels = p_x_validation.columns[sorted_idx])
    ax.set_title("Permutation Importances (Validation set)")
    fig.tight_layout()
    plt.show()
    print('kaggle_explore_ml: Done')
    # End of function kaggle_explore_ml

The function kagge_save_result() saves prediction results in Kaggle format for submission to Kaggle Compete

In [None]:
def kaggle_save_result(p_model, p_column:str, p_predictions: np.ndarray, p_validation_df: str, p_file_name:str) -> None:
    """
    Save prediction results in Kaggle format for submission to compete
    :parameter p_model: The predictions to save
    :parameter p_column: The index column name
    :parameter p_predictions: The prediction results based on Test dataset
    :parameter p_validation_df: The original validation dataset, to extract the index for Kaggle submission (see p_column)
    :parameter p_file_name: The file name without extension file (e.g. './MyResults.csv')
    """
    print('----------------------------- kaggle_save_result -----------------------------')
    # Reload PassengerID list
    validation_df = pd.read_csv(p_validation_df)
    p = validation_df[[p_column]].astype(int)
    # FIXME How to proceed with multiple outputs?
    print(p.shape)
    print(p_predictions.shape)
    print(len(p_predictions.squeeze()))
    pred = pd.DataFrame({p_column: list(p.squeeze()), TARGET_COLUMNS[0]: p_predictions.astype(int).squeeze()})
    pred.to_csv(p_file_name, index = False)
    print('kaggle_save_result: Done')
    # End of function kaggle_save_model

The code below is specific to machine learning. It provides callbacks to create DL models.

In [None]:
# Start of main application
DL_INPUT_SHAPE = None

def kaggle_create_sequential_classifier_model(p_optimizer:str = 'adam', p_loss:str = 'binary_crossentropy', p_metrics:list = ['accuracy']) -> tf.keras.Sequential:
    """
    Build a Neural network model for classification
    """
    print('----------------------------- kaggle_create_sequential_classifier_model -----------------------------')
    model = tf.keras.Sequential([
            tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)
    return model
    # End of function kaggle_create_sequential_classifier_model

def kaggle_create_sequential_regressor_model(p_optimizer:str = 'adam', p_loss:str = 'mae', p_metrics:list = ['mae']) -> tf.keras.Sequential:
    """
    Build a Neural network model for regression
    """
    print('----------------------------- kaggle_create_sequential_regressor_model -----------------------------')
    model = tf.keras.Sequential([
            tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(1, activation='relu'),
    ])
    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)
    return model
    # End of function kaggle_create_sequential_regressor_model

Finaly, here is the entry point function and the main call:

In [None]:
def kaggle_main() -> None:
    global DL_INPUT_SHAPE
    
    # Set defaults
    kaggle_set_seed()
    kaggle_set_mp_default()
    
    # Current path
    print(os.path.abspath(os.getcwd()))
    # Kaggle current path and files
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

    # Modules version
    kaggle_modules_versions()

    # Parse arguments. Used only if this notebook code is used as a standalone Python script
    #flags = ExecutionFlags.NONE
    flags = ExecutionFlags.ALL \
            & ~ExecutionFlags.USE_NEURAL_NETWORK_FLAG \
            & ~ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG \
            & ~ExecutionFlags.DATA_VISUALIZATION_FLAG \
            & ~ExecutionFlags.DATA_TRANSFORM_FLAG
    #parser = argparse.ArgumentParser()
    #parser.add_argument('--summarize', help = 'Process statistical analyze', action='store_true')
    #parser.add_argument('--summarize-only', help = 'Process only statistical analyze', action='store_true')
    #parser.add_argument('--visualize', help = 'Generate different plots based on statistical analyze', action='store_true')
    #parser.add_argument('--no-data-cleaning', help = 'Do not apply Data Cleaning', action='store_true')
    #parser.add_argument('--neural-network', help = 'Use neural network as ML', action='store_true')
    #args = parser.parse_args()
    #if args.summarize or args.summarize_only:
    #    flags |= ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG
    #if args.visualize:
    #    flags |= ExecutionFlags.DATA_VISUALIZATION_FLAG
    #if args.no_data_cleaning:
    #    flags |= ~ExecutionFlags.DATA_CLEANING_FLAG
    #if args.neural_network:
    #    flags |= ExecutionFlags.USE_NEURAL_NETWORK_FLAG
    
    # TODO Uncomment if using Pima Indians iabetes dataset
    #flags &= ~ExecutionFlags.DATA_CLEANING_FLAG
    print('generic template approach to ''play'' with the Machine Learning concepts: flags=%s' % str(flags))

    strategy = None
    if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG:
        strategy = kaggle_tpu_detection()

    train_df, validation_df, test_df = kaggle_load_datasets(p_url = None, p_train_path = '../input/tabular-playground-series-jan-2021/train.csv', p_validation_path = '../input/tabular-playground-series-jan-2021/test.csv')
    #print('Main: training dataset')
    #print(train_df.head())
    #print('Main: validation dataset')
    #print(validation_df.head())
    #print('Main: test dataset')
    #print(test_df.head())

    # Do a basic ML evaluation as reference for the end
    # Take too many time - y_basic_predictions = kaggle_ml_quick_and_dirty(train_df, validation_df, test_df)

    numerical_columns = None
    categorical_columns = None
    if flags & ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG == ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG:
        print('Main: Summurize data for training dataset')
        kaggle_summurize_data(train_df)
        print('Main: Summurize data for validation dataset')
        kaggle_summurize_data(validation_df)
        print('Main: Summurize data for test dataset')
        kaggle_summurize_data(test_df)
    #    if args.summarize_only:
    #        return

    if flags & ExecutionFlags.DATA_VISUALIZATION_FLAG == ExecutionFlags.DATA_VISUALIZATION_FLAG:
        print('Main: Visualisation for train dataset')
        kaggle_visualization(train_df)

    categorical_features = None
    numerical_features = None
    if flags & ExecutionFlags.DATA_CLEANING_FLAG == ExecutionFlags.DATA_CLEANING_FLAG:
        train_df, new_categorical_features, numerical_features = kaggle_features_engineering(train_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
        print(train_df.columns)
        validation_df, _, _ = kaggle_features_engineering(validation_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
        print(validation_df.columns)
        if len(train_df.columns) != len(validation_df.columns):
            l = list(set(train_df.columns) - set(validation_df.columns))
            print('Main: Features unaligned for validation_df: ', l)
            validation_df[l] = 0
            print(validation_df.columns)
        test_df, _, _ = kaggle_features_engineering(test_df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
        print(test_df.columns)
        if len(train_df.columns) != len(test_df.columns):
            l = list(set(train_df.columns) - set(test_df.columns) - set(TARGET_COLUMNS))
            print('Main: Features unaligned for test_df: ', l)
            test_df[l] = 0
            print(test_df.columns)
        print('Main: training dataset after data engineering')
        print(train_df.head())
        print('Main: validation dataset after data engineering')
        print(validation_df.head())
        print('Main: test dataset after data engineering')
        print(test_df.head())
        # Do a basic ML evaluation as reference for the end
        # Take too many time - y_basic_predictions = kaggle_ml_quick_and_dirty(train_df, validation_df, test_df)

    if flags & ExecutionFlags.DATA_TRANSFORM_FLAG == ExecutionFlags.DATA_TRANSFORM_FLAG:
        # Extract non  categorical columns based on categorical_column list
        if not numerical_features is None:
            columns_to_transform = numerical_features
            if not NON_TRANSFORMABLE_COLUMNS is None:
                columns_to_transform = list(set(columns_to_transform) - set(NON_TRANSFORMABLE_COLUMNS))
            if not OUTPUT_IS_REGRESSION:
                # Remove TARGET_COLUMNS from columns_to_transform list
                columns_to_transform = list(set(columns_to_transform) - set(TARGET_COLUMNS))
            print('Main: columns_to_transform: %s' % str(columns_to_transform))
            print('Main: columns_to_transform: ')
            print(train_df.describe())
            train_df = kaggle_data_transform(train_df, columns_to_transform, p_transform = 'scale')
            validation_df = kaggle_data_transform(validation_df, columns_to_transform, p_transform = 'scale')
            if OUTPUT_IS_REGRESSION:
                # Remove TARGET_COLUMNS from columns_to_transform list
                columns_to_transform = list(set(columns_to_transform) - set(TARGET_COLUMNS))
            test_df = kaggle_data_transform(test_df, columns_to_transform, p_transform = 'scale')
            print('Main: training dataset after features transformations')
            print(train_df.head())
            print('Main: validation dataset after features transformations')
            print(validation_df.head())
            print('Main: test dataset after features transformations')
            print(test_df.head())
            # Do a basic DL evaluation as reference for the end
            # Take too many time - y_basic_predictions = kaggle_ml_quick_and_dirty(train_df, validation_df, test_df)

    train_df, dropped_features = kaggle_features_selection(train_df)
    #dropped_features = []
    if len(dropped_features) != 0:
        if set(dropped_features).issubset(set(validation_df.columns)):
            validation_df.drop(dropped_features, inplace = True, axis = 1)
        if set(dropped_features).issubset(set(test_df.columns)):
            test_df.drop(dropped_features, inplace = True, axis = 1)
        print('Main: training dataset after features selection')
        print(train_df.head())
        print('Main: validation dataset after features selection')
        print(validation_df.head())
        print('Main: test dataset after features selection')
        print(test_df.head())

    # Build training & validation datasets
    ml_inputs_training_df, ml_outputs_training_df = kaggle_split_dataset(train_df)
    ml_inputs_validation_df, ml_outputs_validation_df = kaggle_split_dataset(validation_df)
    ml_inputs_test_df, _ = kaggle_split_dataset(test_df)
    print('Main: ml_inputs_training_df')
    print(ml_inputs_training_df.head())
    print('Main: ml_outputs_training_df')
    print(ml_outputs_training_df.head())
    print('Main: ml_inputs_validation_df')
    print(ml_inputs_validation_df.head())
    print('Main: ml_outputs_validation_df')
    print(ml_outputs_validation_df.head())
    print('Main: ml_inputs_test_df')
    print(ml_inputs_test_df.head())

    # Checking models
    models = []
    scoring = None
    if OUTPUT_IS_REGRESSION: # Use regression algorithms
        scoring = 'r2' # 'r2' or 'neg_mean_absolute_error'
        # Take too many time - models.append(('LR', linear_model.LinearRegression()))
        # Take too many time - models.append(('SGDC', linear_model.SGDRegressor(random_state = SEED_HARCODED_VALUE)))
        # Take too many time - models.append(('LASSO', linear_model.Lasso()))
        # Take too many time - models.append(('EN', linear_model.ElasticNet()))
        # Take too many time - models.append(('KNN', neighbors.KNeighborsRegressor(n_neighbors = 8)))
        # Take too many time - models.append(('CART', tree.DecisionTreeRegressor(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
        models.append(('LGBMR', lgb.LGBMRegressor(n_estimators = 1024, num_leaves = 128, max_depth = 32, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
        models.append(('XGB', xgb.XGBRegressor(n_estimators = 1024, learning_rate=0.5, random_state = SEED_HARCODED_VALUE)))
        # Take too many time - models.append(('BGK', ensemble.GradientBoostingRegressor(n_estimators = 256, max_depth = 16, random_state = SEED_HARCODED_VALUE)))
        # Take too many time - models.append(('RF', ensemble.RandomForestRegressor(n_estimators = 1024, max_depth = 32, max_features = 4, random_state = SEED_HARCODED_VALUE)))
        # Take too many time - models.append(('SVR', svm.SVR()))
    else: # Use classifier algorithms
        scoring = 'accuracy'
        models.append(('LR', linear_model.LogisticRegression(random_state = SEED_HARCODED_VALUE)))
        models.append(('SGDC', linear_model.SGDClassifier(random_state = SEED_HARCODED_VALUE)))
        models.append(('LDA', discriminant_analysis.LinearDiscriminantAnalysis()))
        models.append(('KNN', neighbors.KNeighborsClassifier(n_neighbors = 8)))
        models.append(('CART', tree.DecisionTreeClassifier(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
        models.append(('LGBMC', lgb.LGBMClassifier(n_estimators = 1024, num_leaves = 128, max_depth = 8, learning_rate=0.05, random_state = SEED_HARCODED_VALUE)))
        models.append(('XGB', xgb.XGBClassifier(n_estimators = 1024, learning_rate=0.5, random_state = SEED_HARCODED_VALUE)))
        models.append(('BGK', ensemble.GradientBoostingClassifier(n_estimators = 256, max_depth = 32, random_state = SEED_HARCODED_VALUE)))
        models.append(('RF', ensemble.RandomForestClassifier(n_estimators = 1024, max_depth = 32, max_features = 4, random_state = SEED_HARCODED_VALUE)))
        models.append(('NB', naive_bayes.GaussianNB()))
        models.append(('SVC', svm.SVC(random_state = SEED_HARCODED_VALUE)))

    results, names = kaggle_check_models(models, ml_inputs_training_df, ml_outputs_training_df, p_cross_validation = 'k-fold', p_scoring = scoring)
    best_alg = kaggle_compare_algorithms_perf(names, results, 'Algorithms Comparison', 'Algorithms', 'Accuracy')
    # Take too many time - ml = kaggle_algorithm_tuning(models[best_alg], ml_inputs_training_df, ml_outputs_training_df, (ml_inputs_validation_df, ml_outputs_validation_df))
    ml = models[best_alg][1]
    ml.fit(ml_inputs_training_df, ml_outputs_training_df)

    y_predictions = kaggle_validation_prediction(ml, ml_inputs_validation_df, ml_outputs_validation_df)
    # Take too many time - kaggle_explore_ml(ml, ml_inputs_validation_df, y_predictions)

    y_predictions = kaggle_prediction(ml, ml_inputs_test_df)
    # Take too many time - kaggle_explore_ml(ml, ml_inputs_test_df, y_predictions)

    print('Main: Save the model')
    file_name = ml.__class__.__name__
    kaggle_save_model(ml, '/kaggle/working/', file_name)

    print('Main: Save Kaggel compete submission')
    kaggle_save_result(ml, 'id', y_predictions, '../input/tabular-playground-series-jan-2021/test.csv', '/kaggle/working/result.csv')

    ## Take too many time - print('Main: Reload the model')
    # Take too many time - ml = kaggle_load_model('/kaggle/working/', file_name)
    #y_predictions = kaggle_validation_prediction(ml, ml_inputs_validation_df, ml_outputs_validation_df)
    #kaggle_explore_ml(ml, ml_inputs_validation_df, y_predictions)

    print('Main: End of processing')
    # End of function kaggle_main

Ouf, now, we can execute all the sequences described above and get some results:

In [None]:
# Entry point
print("Starting at ", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
kaggle_main()
print("Ending at ", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))

**If you liked this Notebook, please upvote.
Gives Motivation to make new Notebooks :)**