As a beginner, I made this notebook to present a generic approach to "play" with the concepts of machine learning and neural network. I have also tried to provide some clean Python code. To sum up, it is a synthesis of my current knowledge (and sorry for my English).

This is a first version which will be improved compete after compete.

The basic steps to define a 'Generic approach of Machine Learning' are:
1. Define the problem, I mean understand the data you got and define what are the inputs (attributes) and what is the output (target) of your Machine Learning
2. Summarize the dataset content in a statistical form
3. Prepare the dataset for your Machine Learning processing
4. Evaluate a set of algorithms based on you understanding of the data
5. Improve the results of your Machine Learning by refining the algorithms
6. Present the results of your Machine Learning
7. Deploy or save your Machine Learning

**NOTE: Please, feel free to correct and enhance this notebook ;)**

To define the problem, we have first to choose the subject we will work on. The point 1.b provides different datasets we can use to play. For each dataset, a comment describes the problem to address. 
We will consider two different problems:
1. One about classification (the basic one is the Iris classification)
2. One about regression (Melbourne housing prices)

This is the part that cannot be generic. The generic behavior proposed here is parameterized by the set of parameters defined in point b.1.

Note: In point b.1, to swith to another problem, just comment the current one and uncomment the problem to play with

Switching to Python code, the first step is to load all the required libraries (1.a) and to choose the problem to solve, let's say Iris classification.

In [None]:
from __future__ import division # Import floating-point division (1/4=0.25) instead of Euclidian division (1/4=0)

# 1. Prepare Problem

# a) Load libraries
import os, warnings, argparse, io, operator, requests

import numpy as np # Linear algebra
import matplotlib.pyplot as plt # Data visualization
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)

from pandas_profiling import ProfileReport

import sklearn
from sklearn import model_selection
from sklearn import linear_model # Regression
from sklearn import discriminant_analysis
from sklearn import neighbors # Clustering
from sklearn import naive_bayes
from sklearn import tree # Decisional tree learning
from sklearn import svm # Support Vector Machines
from sklearn import ensemble # Support RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor

import xgboost as xgb # Gradient Boosted Decision Trees algorithm

from sklearn.base import is_classifier, is_regressor # To check if the model is for regression or classification (do not work for Keras)

from sklearn.impute import SimpleImputer 

from sklearn.preprocessing import LabelEncoder # Labelling categorical feature from 0 to N_class - 1('object' type)
from sklearn.preprocessing import LabelBinarizer # Binary labelling of categorical feature
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import StandardScaler # Data normalization
from sklearn.preprocessing import MinMaxScaler # Data normalization
from sklearn.preprocessing import MaxAbsScaler # Data normalization

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn.inspection import permutation_importance

import pickle # Use to save and load models

import eli5
from eli5.sklearn import PermutationImportance

# Neural Network
import tensorflow as tf
import keras
from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor

First of all, we have to define the problem:
1. Understand the data, see point b.1) below
2. Prepare the basics of your code such as loading the libraries and your data, see points a) and b) below

In point b.1, we have a set of parameters strongly linked to the problem to solve. These parameters are used to configure the execution of 'Generic approach of Machine Learning':
- DATABASE_NAME: The URI of the dataset
- COLUMNS_LABEL: Columns label of the dataset. Default: None, means that labels are already present in the loaded dataset- COLUMNS_TO_DROP: The useless columns to drop after loading the dataset
- POST_LOAD_PROCESSING: a lamdba function to apply to the whole loaded dataset
- FEATURES_SELECTION: The list of the features for the ML inputs. Default: None, means - all columns (excepted the output columns) are concidered as features
- TARGET_COLUMN: The output column
- OUTPUT_IS_REGRESSION: Indicates if the ML is about either regression (True) or classification (False)
- DATE_TIME_COLUMNS: The list of the date/time column in customized format such as string format
- FEATURES_PRE_PROCESSING: This dictionary attaches a lamdba function to apply to a column. The Lambda function is a processing to apply to the column just after loading the dataset (point 1.c).
- NON_TRANSFORMABLE_COLUMNS: Indicates a list of columns which shall not be included in the transformation process (point 3.b)

TODO: Add  features creation lamba code


In [None]:
# b) Helpers

# b.1) Define global parameters
# Regression

# To predict house price using the famous Melbourne housing dataset
DATABASE_NAME = 'https://raw.githubusercontent.com/nagoya-foundation/r-functions-performance/master/data/Melbourne_housing_FULL.csv'
POST_LOAD_PROCESSING = None
COLUMNS_LABEL = None
COLUMNS_TO_DROP = ['Address', 'Method', 'Postcode', 'CouncilArea', 'Propertycount', 'Regionname', 'SellerG']
FEATURES_SELECTION = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
TARGET_COLUMN = 'Price'
OUTPUT_IS_REGRESSION=True
DATE_TIME_COLUMNS = ['Date']
FEATURES_PRE_PROCESSING = None
NON_TRANSFORMABLE_COLUMNS = None
# Suburb
# Address
# Rooms
# Type
# Price
# Method
# SellerG
# Date
# Distance
# Postcode
# Bedroom2
# Bathroom
# Car
# Landsize
# BuildingArea
# YearBuilt
# CouncilArea
# Lattitude
# Longtitude
# Regionname
# Propertycount

# Classification
# To categorize an iris flower according to the dimensions of its sepals & petals 
# Famous database; from Fisher, 1936
#DATABASE_NAME = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'class'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#COLUMNS_LABEL = ['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class']

# To predict survival on the Titanic
# This is the list of the passengers
#DATABASE_NAME = 'https://raw.githubusercontent.com/alexisperrier/packt-aml/master/ch4/titanic.csv'
#POST_LOAD_PROCESSING = lambda x: pd.concat([pd.DataFrame({'passengerid': np.arange(x.shape[0])}), x], axis = 1)
#COLUMNS_LABEL = None
#COLUMNS_TO_DROP = ['name', 'ticket', 'embarked', 'boat', 'body', 'home.dest']
    # We assume that Name,Ticket and are not relevant information
    # This can be confirm by the correlation matrix
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'survived'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_PRE_PROCESSING = { 'Cabin': lambda p_value : p_value[0:1] if not p_value is np.NaN else np.NaN }
#NON_TRANSFORMABLE_COLUMNS = ['passengerid']
#  PassengerId: Unique passenger id
#  Survived: Survival status ('Yes' or 'No')
#  Pclass: The class the passeger belong (1st, 2nd or 3rd class)
#  Name: Name of the passenger
#  Sex: The sex of the passenger ('male' of 'female')
#  Age: The age of the passenger (in years)
#  SibSp: # of siblings / spouses aboard the Titanic
#  Parch: # of parents / children aboard the Titanic
#  Ticket: No description available for this field, perhaps the travel company identifier
#  Fare: Ticket price
#  Cabin: Identifier of the cabin. The first character identifies the deck.
#         This could be interesting fo the ML, creating a new feature Deck
#  Embarked: Port of Embarkation

# This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ve years.
# NOTE: Disable flag DATA_CLEANING_FLAG, this dataset is already ready to be used by ML 
#DATABASE_NAME = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'class'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_PRE_PROCESSING = None
#COLUMNS_LABEL = ['preg', 'plas', 'pres (mm Hg)', 'skin (mm)', 'test (mu U/ml)', 'mass', 'pedi', 'age (years)', 'class']
#  preg = Number of times pregnant
#  plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
#  pres = Diastolic blood pressure
#  skin = Triceps skin fold thickness (mm)
#  test = 2-Hour serum insulin (mu U/ml)
#  mass = Body mass index (weight in kg/(height in m)^2)
#  pedi = Diabetes pedigree function
#  age = Age (years)
#  class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

# This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row
#DATABASE_NAME = 'https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.data.csv'
#COLUMNS_TO_DROP = None
#FEATURES_SELECTION = None
#TARGET_COLUMN = 'No-show'
#OUTPUT_IS_REGRESSION=False
#DATE_TIME_COLUMNS = None
#FEATURES_PRE_PROCESSING = None
#COLUMNS_LABEL = None


Before to load and to examine our dataset, we are just going to set a number of defaults such as the settings for the plotting operation, Deep Learning parameters... (point b.2)

Note: Point b.3 shall be used if you want to 'reassemble the notebook code and create a standalone Python scrypt

In [None]:
# b.2) Set some defaults
def set_mp_default() -> None:
    """
    Some default setting for Matplotlib plots
    """
    plt.rc('figure', autolayout=True)
    plt.rc('axes', labelweight='bold', labelsize='large', titleweight='bold', titlesize=18, titlepad=10)
    plt.rc('image', cmap='magma')
    warnings.filterwarnings("ignore") # to clean up output cells
    pd.set_option('precision', 3)
    # End of function set_mp_default

# Basic Deep Learning parameters
DL_BATCH_SIZE = 32
DL_EPOCH_NUM = 128
DL_DROP_RATE = 0.3

# Fix random values for reproductibility
SEED_HARCODED_VALUE = 666

def set_seed(p_seed: int = SEED_HARCODED_VALUE) -> None:
    """
    Random reproducability
    :parameter p_seed: Set the seed value for random functions
    """
    np.random.seed(p_seed)
    sklearn.utils.check_random_state(p_seed)
    #tf.set_seed(p_seed)
    os.environ['PYTHONHASHSEED'] = str(p_seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    # End of function set_seed

def modules_versions() -> None:
    """
    Print the different modules version
    """
    print('----------------------------- modules_versions -----------------------------')
    print("Numpy version: " + np.__version__)
    print("Pandas version: " + pd.__version__)
    print("sklearn version: " + sklearn.__version__)
    print("Tensorflow version: " + tf.__version__)
    print('modules_versions: Done')
    # End of function modules_versions
    
def kaggle_tpu_detection():
    """
    TPU detection
    :return: The appropriate distribution strategy
    """
    print('----------------------------- kaggle_tpu_detection -----------------------------')
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver() 
        print('kaggle_tpu_detection: Running on TPU ', tpu.master())
    except ValueError:
        tpu = None
    if tpu:
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
    else:
        strategy = tf.distribute.get_strategy() 
    print('kaggle_tpu_detection: %s' % str(strategy.num_replicas_in_sync))
    print('kaggle_tpu_detection: %s' % str(type(strategy)))
    print('kaggle_tpu_detection Done')
    return strategy
    # End of function kaggle_tpu_detection

# b.3) Set execution control flags
from enum import IntFlag

class ExecutionFlags(IntFlag):
    """
    This class provides some execution control flags to enable/disable some part of the whole script execution
    """
    NONE                     = 0b00000000 # All flags disabled
    ALL                      = 0b11111111 # All flags enabled
    DATA_STAT_SUMMURIZE_FLAG = 0b00000001 # Enable statitistical analyzis
    DATA_VISUALIZATION_FLAG  = 0b00000010 # Enable data visualization
    DATA_CLEANING_FLAG       = 0b00000100 # Enable data cleaning (feature engineering, data transform)
    USE_NEURAL_NETWORK_FLAG  = 0b00001000 # Enable data visualization
    # End of class ExecutionFlags

Now, we are ready to load our dataset and examine it to understand the data it contains. This function accept any URI (e.g. file:///... or http://... or https://...).
Loading the dataset, you can specify or overwrite columns labels. 

In [None]:
# c) Load dataset
def kaggle_load_dataset(p_url: str, p_labels: list = None) -> pd.core.frame.DataFrame:
    """
    This function load the dataset specified by p_url and add the labels if required
    :parameters p_url: The URI of the dataset (http:// or file://)
    :parameters p_labels: The label of the columns to be used. Default: None
    :return: The dataset handle
    :exception: Raised if specified link is not correct
    """
    print('----------------------------- kaggle_load_dataset -----------------------------')

    df = None
    if p_url.startswith('file://'):
        df = pd.read_csv(p_url[7:])
    elif p_url.startswith('http'):
        ds = requests.get(p_url).content
        df = pd.read_csv(io.StringIO(ds.decode('utf-8')))
    if df is None:
        raise Exception('kaggle_load_dataset: Failed to load data frame', 'url=%s' % (url))
    if not p_labels is None:
        df.columns = p_labels
    # Apply post processing after loading dataset
    if not POST_LOAD_PROCESSING is None:
        df = POST_LOAD_PROCESSING(df)
    # Drop columns if any
    if not COLUMNS_TO_DROP is None:
        df.drop(COLUMNS_TO_DROP, inplace = True, axis = 1)
    if not FEATURES_PRE_PROCESSING is None and isinstance(FEATURES_PRE_PROCESSING, dict):
        for key in FEATURES_PRE_PROCESSING.keys():
            df[key] = df[key].apply(FEATURES_PRE_PROCESSING[key])
    print('kaggle_load_dataset: Done: %s - %s' % (p_url, df.shape))
    return df
    # End of function kaggle_load_dataset

### Examining the dataset means get a global overview of its data from statistical point of view, using:
1. Some basics statistical tools such as means, stds, quartiles and correlation (2.a)
2. Some visualization tools such as histograms, density plots (2.b)

Understanding the data is the most important step. The kaggle_summurize_data() function provide you a lot of information to help you in this task:
- Dataset info: It provides information about the structure of the data:
1) The number of features (or attributes or columns), and the name (or label) of each. Here, it is important to understand what each feature means, what can be the values for this feature, take care of the units... A lot of research work to understand our problem,
2) The types of each feature. 'object' type indicates categorical features, it means we should have to do some imputations,
3) One or several of these feature will be our ML output and some of them could be removed later because of poor interest to solve our problem (e.g. features with huge correlation, feature reduction using ACP...),
3) The number of observations (or samples) in the dataset. This will be useful to split our datatset into training, validation and test dataset.
- Dataset columns labels: It indicates the name (or label) of each attributes
- Means: It provides you the mean value for each features (also provided by statistical abstract, see below)
- Dataset statistical abstract: It provides, for each feature, basic statistical metrics such as means, stds, quartiles...
- Dataset Head: It displays the fisrt samples of the dataset. It provides you some indication of the value of each observation. Note that it is not suffisient to detect specific values such as NULL or NaN values, zeros, string values, categorical values... 
- Unique possible columns: It provides, for each feature, the list of the unique values. This will help you during the data transformation to rescale and center the feature values (see point 3.c). Very often, a feature with few unique values (e.g. 2 or 3) indicates also a categorical fetaure,
- Correlation table: It provides the correlation between all couple of features and the list of the correlation values in the range > 0.7 and < -0.7. The will be used to reduce the number of features due to strong link between some features (see p_correlation_threshold parameter)

Note: Here we use pandas_profiling to generate an analyze report in HTML format. This report is higly valuable because of the information it provides for each columns:
1. Specific value indicators such as zeros, NaN...
2. Distincts values
3. Statistical values such as mean, min/max...

In [None]:
# 2. Summarize the dataset content in a statistical form
# a) Descriptive statistics
def kaggle_summurize_data(p_df: pd.core.frame.DataFrame, p_correlation_threshold: float = 0.7) -> None:
    """
    This function provides a statistical view of the current dataset
    :parameters p_df: The dataset handle
    """
    print('----------------------------- kaggle_summurize_data -----------------------------')
    # General information
    print('Dataset info:')
    print(p_df.info())
    print('----------------------------- kaggle_summurize_data: Dataset columns labels:')
    print(p_df.columns)
    print('----------------------------- kaggle_summurize_data: Means:')
    print(p_df.mean())
    print('----------------------------- kaggle_summurize_data: Dataset statistical abstract:')
    print(p_df.describe().T)
    print('----------------------------- kaggle_summurize_data: Dataset Head:')
    print(p_df.head(20))
    # NaN values
    print('----------------------------- kaggle_summurize_data: NaN values: %d - %f' % (p_df.isnull().sum().sum(), 100 * p_df.isnull().sum().sum() / np.product(p_df.shape)))
    print('----------------------------- kaggle_summurize_data: NaN values distribution:')
    print(p_df.isnull().sum().sort_values(ascending = False))
    # Zeros per columns
    print('----------------------------- kaggle_summurize_data: Zeros per columns:')
    for column in p_df.columns:
        if p_df[column].dtype == 'int64' or p_df[column].dtype == 'float64':
            zeros = p_df[column].isin([0]).sum()
            s = p_df[column].sum()
            print('{}: {}'.format(column, zeros, 100 * zeros / s))
        else:
            print('%s: Not numerical column' % column)
    #  Unique possible columns
    print('----------------------------- kaggle_summurize_data: Unique possible columns:')
    for column in p_df.columns:
        print('{}: {}'.format(column, p_df[column].unique()))
    # Build Correlation matrix
    print('----------------------------- kaggle_summurize_data: Correlation table:')
    print(p_df.corr(method = 'pearson'))
    # Extract correlation > 0.7 and < -0.7
    print('----------------------------- kaggle_summurize_data: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    corr = p_df.corr().unstack().reset_index() # Group together pairwise
    corr.columns = ['var1', 'var2', 'corr'] # Rename columns to something readable
    print(corr[ (corr['corr'].abs() > p_correlation_threshold) & (corr['var1'] != corr['var2']) ] )
    # Finally, create Pandas Profiling
    print('----------------------------- kaggle_summurize_data: Pandas Profiling:')
    file = ProfileReport(p_df)
    #file.to_file('./eda.html')
    file.to_notebook_iframe()
    print('kaggle_summurize_data: Done')
    # End of function kaggle_summurize_data

The kaggle_visualization() function provides different plot to explore the data distrubution (gaussian, exponecial...) and to detect outlier values. It will help 1) during the data cleaning and 2) later, to choose the ML algorithms (e.g. Outliers do not affect a tree-based algorithm).
There are two kind of data visualition:
- The Univariate Plots which are related to each features, and
- The Multivariate Plots which are related to interaction between features

The Univariate Plots:
- Histograms: It provides a graphical representation of the distribution of a dataset. For a continuous numerical, it show the underlying frequency distribution or the probability  distribution of signal (see https://towardsdatascience.com/histograms-why-how-431a5cfbfcd5)
- Density: It is the continuous form of the histogram (see above) and it shows an estimate of the continuous distribution of a feature (Gaussian distribution, exponential distribution...)

The Multivariate Plots
- Correlationan: It provides indications about the changes between two features
- scatter_matrix: It shows how much one feature is affected by another or the relationship between them

In [None]:
# b) Data visualizations
def kaggle_visualization(p_df: pd.core.frame.DataFrame) -> None:
    """
    This method provides different views of the dataset (plot)
    :parameters p_df: The dataset handle
    """
    print('----------------------------- kaggle_visualization_data -----------------------------')
    # Histogram plot
    print('kaggle_summurize_data: Histograms')
    p_df.hist()
    #plt.savefig('kaggle_histo.png')
    plt.show()
    # Density plot
    print('kaggle_summurize_data: Density')
    p_df.plot(kind='density', subplots = True, layout = (4, 4))
    #plt.savefig('density.png')
    plt.show()
    # Correlation plots
    print('kaggle_summurize_data: Correlation')
    correlations = p_df.corr()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlations, vmin = -1, vmax = 1, interpolation = 'none')
    fig.colorbar(cax)
    ticks = np.arange(0, len(p_df.columns), 1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(p_df.columns)
    ax.set_yticklabels(p_df.columns)
    #plt.savefig('correlations_matrix.png')
    plt.show()
    # Pandas scatter_matrix plot
    print('kaggle_summurize_data: scatter_matrix')
    from pandas.plotting import scatter_matrix
    scatter_matrix(p_df)
    #plt.savefig('scatter_matrix.png')
    plt.show()
    print('kaggle_summurize_data: Done')
    # End of function kaggle_visualization

The function kaggle_ml_quick_and_dirty() provides a 'quick and dirty' evaluation of a ML based on RandomForestClassifier algorithm with estimators parameter set to 300
FIXME: To be refined, does not work as expected :(

In [None]:
# c) Basic ML for a quick & dirty evaluation
def kaggle_ml_quick_and_dirty(p_df: pd.core.frame.DataFrame, p_validation_size: float = 0.20, p_seed:int = SEED_HARCODED_VALUE) -> np.ndarray:
    """
    This method provides a first ML evalulation based on RandomForest algorithm
    :parameters p_df: The dataset handle
    :parameter p_validation_size: The amount of data fir training and validation. Default: 20% of the dataset will be used for validation, 80% of the dataset will be used for training
    :return: The predictions based on the  
    """
    print('----------------------------- kaggle_ml_quick_and_dirty -----------------------------')
    p = p_df.copy()
    # Remove NaN values
    p.dropna(axis = 0, inplace = True)
    # Ignore categorical values
    p = p.select_dtypes(exclude=['object'])
    # Build training & validation datasets
    Y = p[TARGET_COLUMN]
    if FEATURES_SELECTION is None:
        X = p.drop([TARGET_COLUMN], axis = 1)
    else:
        X = p[FEATURES_SELECTION]
    print('----------------------------- kaggle_ml_quick_and_dirty: X')
    print(X.head())
    X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size = p_validation_size, random_state = p_seed)
    model = None
    # Use classical model
    if OUTPUT_IS_REGRESSION:
        model = ensemble.RandomForestRegressor(random_state = p_seed)
    else:
        model = ensemble.RandomForestClassifier(random_state = p_seed)
    # Train the model
    model.fit(X_train, Y_train)
    # Do predictions
    y_predictions = model.predict(X_validation)
    # Get scoring
    if OUTPUT_IS_REGRESSION:
        print('kaggle_ml_quick_and_dirty: Model R2 score: %0.4f' % r2_score(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: Model Mean absolute error regression loss: %0.4f' % mean_absolute_error(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: Mean squared error regression loss: %0.4f' % mean_squared_error(Y_validation, y_predictions))
    else:
        print('kaggle_ml_quick_and_dirty: Model accuracy score: %0.4f' % accuracy_score(Y_validation, y_predictions))
        print('kaggle_ml_quick_and_dirty: Confusion matrix: %s' % str(confusion_matrix(Y_validation, y_predictions)))
        print('kaggle_ml_quick_and_dirty: Classification report:\n%s' % str(classification_report(Y_validation, y_predictions)))
    print('kaggle_ml_quick_and_dirty: Done')
    return y_predictions
    # End of function kaggle_ml_quick_and_dirty

Now we need to do a break and we need:
1. to understand exactly what each column is?
2. to learn from the results we got
For instance:
1. Let's take a look to the number of '0' value in each columns and tryu to understand what does it mean? In PIMA diabetes dataset, what does it mean a Glucose or a blood pressure value of 0? It's not possible!!! In this case, the solution is to replace '0' bay a NaN value and the Impute process will do the job ;) Idem with maximal value per column

So, the next step is to prepare the data for ML. Usually, you have better result when all the features (features and outputs) are in numerical format (int or float).

1. Feature engineering. It eliminates NULL or NaN values, duplicate values, and it transforms date/time column, categorical columns into numerical fetures. It identifies & handles outliers... (3.a). Categorical columns are usually of type object and the objective here is to transform these categorical columns into numerical columns. Date/time columns can be either object (e.g. date/time in string format) of type datetime64[ns]. For sepcific features such as 'Age', it is possible to create new feature grouping the Age values per range, between from the lower Age value to the upper Age value
2. Data transformation. It applies some numerical transformation such as standardization of features... (3.b)
3. Features selection. It selects and prepares the dataset for the training and the validation (3.c)

In [None]:
# 3. Prepare the dataset for your Machine Learning processing
# a) Data Cleaning
Encoders = dict()
Encoder_Instance = LabelEncoder() # Use global variable for future reverse features engineering
Imputer_Instance = None
def kaggle_features_engineering(p_df: pd.core.frame.DataFrame, p_missing_value_method: str = 'drop_columns', p_duplicated_value_method: str = 'drop_columns', p_categorical_onehot_threshold: int = 10, p_date_time_columns: list = None, p_date_time_engineering: str = 'python_time') -> pd.core.frame.DataFrame:
    """
    This function performs a cleaning of the dataset to remove null values, duplicate values, based on the specified method
    :parameters p_df: The dataset handle
    :parameters p_missing_value_method: The method to cleanup NaN values in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_duplicated_value_method: The method to cleanup duplicated in the dataset ('drop_columns', 'drop_lines', 'mean', 'median'). Default: 'drop_columns'
    :parameters p_categorical_onehot_threshold: The maximum cardinality to apply OneHotEncoder to a categorical variable. Defaut: 5
    :parameters p_date_time_engineering: The method to convert Date/Time. Default: 'python_time'
    :return: The dataset after the cleanup process
    """
    global Encoders, Encoder_Instance, Imputer_Instance
    
    print('----------------------------- kaggle_features_engineering -----------------------------')
    # Cleanup dataset
    old_shape = p_df.shape
    p = p_df.copy() # The final dataset

    # Convert Date/time columns
    # dtype = 'datetime64[ns]'
    print('----------------------------- kaggle_features_engineering: Processing Date/Time columns')
    if p_date_time_columns is not None: # Process specified columns
        # Check date/time formats
        for column in p_date_time_columns: # TODO Check if all DateTime values have the same format, i.e. same length
            date_lengths = p[column].str.len().value_counts()
            print('kaggle_features_engineering: %s lengths:' % column)
            print('%s - %d' % (str(date_lengths), len(date_lengths)))
            # End of 'for' statement
        p[p_date_time_columns] = p[p_date_time_columns].astype('datetime64[ns]')
        p[p_date_time_columns] = p[p_date_time_columns].astype('int64')
        print('kaggle_features_engineering: Date/time columns processed')
    else:
        print('kaggle_features_engineering: No date/time values')
    # Be sure there is no more 'datetime64[ns]' types in the dataset
    datetime_columns = [col for col in p.columns if p[col].dtype == 'datetime64[ns]']
    if len(datetime_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has datetime64[ns] type in dataset', 'method=%s' % str(p.info()))

    # Find N/A values for categorical columns and replace them by the value with the higher frequency
    print('----------------------------- kaggle_features_engineering: Processing NaN values')
    categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
    if len(categorical_columns_with_nan) != 0:
        print('----------------------------- kaggle_features_engineering: Impute NaN values for categorical columns with MAX value')
        for col in categorical_columns_with_nan:
            p[col].fillna(p[col].value_counts().idxmax(), inplace = True)
            # End of 'for'statement
        # Check that there are no more categorical columns with NaN
        categorical_columns_with_nan = [col for col in p.columns if p[col].dtype == 'object' and p[col].isna().sum() != 0]
        if len(categorical_columns_with_nan) != 0:
            raise Exception('kaggle_features_engineering: There still has categorical columns with NaN value in dataset', 'method=%s' % str(categorical_columns_with_nan))
    else:
        print('----------------------------- kaggle_features_engineering: No NaN value in categorical columns')
    # Use Imputation to replace NaN in numerical columns
    print('----------------------------- kaggle_features_engineering: Impute NaN values for numerical columns with %s method' % p_missing_value_method)
    numerical_columns_with_nan = [col for col in p.columns if (p[col].dtype == 'int64' or p[col].dtype == 'float64') and p[col].isna().sum() != 0]
    if len(numerical_columns_with_nan) != 0:
        print('kaggle_features_engineering: cols_with_missing: %s' % (str(numerical_columns_with_nan)))
        # Find rows with missing values
        rows_with_null = p[numerical_columns_with_nan].isnull().any(axis=1)
        rows_with_missing = p[rows_with_null]
        print('kaggle_features_engineering: rows_with_missing: %s/%s' % (rows_with_missing.shape[0], p.shape[0]))
        # Impute missimg values
        if p_missing_value_method == 'drop_columns' and len(numerical_columns_with_nan) != 0: # Impute removing columns
            p = p.drop(numerical_columns_with_nan, axis = 1)
        elif p_missing_value_method == 'drop_lines' and len(rows_with_null) != 0: # Impute removing rows
            p = p.dropna()
        else: # Imputate using SimpleImputer
            labels = p.columns # Save labels
            if p_missing_value_method == 'mean':
                Imputer_Instance = SimpleImputer(strategy='mean')
            elif p_missing_value_method == 'median':
                Imputer_Instance = SimpleImputer(strategy='median')
            else:
                raise Exception('kaggle_features_engineering: Invalid method', 'method=%s' % (p_missing_value_method))
            p[numerical_columns_with_nan] = pd.DataFrame(Imputer_Instance.fit_transform(p[numerical_columns_with_nan]))
            # Restore column names
            p.columns = labels
            print('kaggle_features_engineering: Cleaning NaN values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
    else:
        print('kaggle_features_engineering: No missing values in numerical columns')
    print('----------------------------- kaggle_features_engineering: After First round:')
    print(p.head())
    print(p.describe().T)

    # Search for categorical variables
    print('----------------------------- kaggle_features_engineering: Encoding categorical columns:')
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    new_categorical_columns = None
    if len(categorical_columns) != 0:
        print('kaggle_features_engineering: categorical_columns: ' + str(categorical_columns))
        # Compute cardinalities of the categorical vairiables
        categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
        print('kaggle_features_engineering: categorical_columns_cardinalities: ')
        print(categorical_columns_cardinalities)
        print('kaggle_features_engineering: OneHotEncoder thresholds: %d' % p_categorical_onehot_threshold)
        # Apply OneHot encoding to categorical value with very low cardinality
        cols_processed = []
        new_categorical_columns = categorical_columns.copy()
        for i in range(0, len(categorical_columns)):
            if categorical_columns_cardinalities[i] == 2: # Use BinaryEncoder
                print('kaggle_features_engineering: LabelBinarizer: %s' % categorical_columns[i])
                Encoders[categorical_columns[i]] = LabelBinarizer()
                p[categorical_columns[i]] = Encoders[categorical_columns[i]].fit_transform(p[categorical_columns[i]])
                new_categorical_columns.remove(categorical_columns[i])
            elif categorical_columns_cardinalities[i] <= p_categorical_onehot_threshold:
                print('kaggle_features_engineering: OneHotEncoder: %s' % categorical_columns[i])
                Encoders[categorical_columns[i]] = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
                new_col = Encoders[categorical_columns[i]].fit_transform(pd.DataFrame(p[categorical_columns[i]]))
                new_col = pd.DataFrame(new_col, columns = [(categorical_columns[i] + "_" + str(j)) for j in range(new_col.shape[1])])
                new_col.index = p[categorical_columns[i]].index
                p = p.drop(categorical_columns[i], axis = 1).join(new_col)
                cols_processed.append(categorical_columns[i])
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
                new_categorical_columns.extend(new_col.columns.tolist())
            else:
                # Just drop them for the time being
                # FIXME To be refined using TargetEncoder
                p.drop(categorical_columns[i], axis = 1, inplace = True)
                # Update the list of the categorical columns
                new_categorical_columns.remove(categorical_columns[i])
            # End of 'for' statement
        if len(cols_processed) != 0:
            print('kaggle_features_engineering: Encoders applied on %s' % str(cols_processed))
            print('kaggle_features_engineering: New datase structure:')
            print(p.describe().T)
            print(p.head())
            categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
            print('kaggle_features_engineering: Cleaning categorical values: old_shape: %s / new shape: %s' % (str(old_shape), str(p.shape)))
            print('kaggle_features_engineering: new Categorical columns:')
            print(categorical_columns)
            # Compute new cardinalities of the categorical vairiables
            categorical_columns_cardinalities = list(map(lambda col: p[col].nunique(), categorical_columns))
            print('kaggle_features_engineering: New categorical_columns_cardinalities: ')
            print(categorical_columns_cardinalities)
        # TODO: Drop categorical variables with extrem cardinalities
        # Encode categorical variables using numerical mapping
        for col in categorical_columns:
            p[col] = Encoder_Instance.fit_transform(p[col].astype(str))
            # End of 'for' statement
            print('kaggle_features_engineering: Labelling:')
            print('kaggle_features_engineering: after second round:')
            print(p.head())
            # End of 'for' statement
    else:
        print('kaggle_features_engineering: No categorical values')
    # Be sure there is no more 'object' types in the dataset
    categorical_columns = [col for col in p.columns if p[col].dtype == 'object']
    if len(categorical_columns) != 0:
        raise Exception('kaggle_features_engineering: There still has object type in dataset', 'method=%s' % str(categorical_columns))
    print('----------------------------- kaggle_features_engineering: After Second round:')
    print(p.head())
    print(p.describe().T)

    # TODO: Removing duplicated records
    # Build Correlation matrix
    print('----------------------------- kaggle_features_engineering: Correlation table:')
    # Extract correlation > 0.7 and < -0.7
    cor_matrix = p_df.corr(method = 'pearson')
    print('----------------------------- kaggle_features_engineering: cor_matrix:')
    print(cor_matrix)
    upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k = 1).astype(np.bool))
    print('----------------------------- kaggle_features_engineering: upper_tri:')
    print(upper_tri)
    p_correlation_threshold = 0.7
    print('----------------------------- kaggle_features_engineering: Correlations in range > %f and < -%f:' % (p_correlation_threshold, p_correlation_threshold))
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > p_correlation_threshold)]
    print('kaggle_features_engineering: Drop ', to_drop)
    # Drop correlated columns
    p.drop(to_drop, axis = 1, inplace = True)
    print('----------------------------- kaggle_features_engineering: After Third round:')
    print(p.head())
    print(p.describe().T)
    
    # Identifying & handling outliers
    # Covered by feature transformation: Scaling will remove the outliers 

 #   raise Exception('Stop', 'Stop')

    print('kaggle_features_engineering: ', list(new_categorical_columns)) 
    print('kaggle_features_engineering: Done') 
    return p, new_categorical_columns
    # End of function kaggle_features_engineering

There are different kinds of data transformation:
- Standardization: It removes the mean and scaling to unit variance of the feature (see point 2.a)
- Scaling: It rescales the feature values in a range of 0 and 1

In [None]:
# b) Data Transforms
def kaggle_data_transform(p_df: pd.core.frame.DataFrame, p_columns:list = None, p_transform: str = 'standard') -> pd.core.frame.DataFrame:
    """
    Apply data transformation to the provided dataset
    :parameters p_df: The dataset handle
    :parameters p_columns: The columns to apply transformation (e.g. we don't apply transformation on categorical column)
    :parameter p_transform: The type of transormation. Default: 'standard'
                            'standard': Remove the mean and scaling to unit variance
                            'scale': Scale feature to a Min/max range
                            'abs_scale': Scale feature to a range [-1, 1]
    :return: The dataset after features selection
    """
    print('----------------------------- kaggle_data_transform -----------------------------')
    transform = None
    if p_transform == 'standard':
        # Standardization, or mean removal and variance scaling
        transform = StandardScaler()
    elif p_transform == 'scale':
        # Scaling features to a range
        transform = MinMaxScaler()
    elif p_transform == 'abs_scale':
        # Scaling features to a range
        transform = MaxAbsScaler()
    else:
        raise Exception('kaggle_data_transform: Wrong parameters', 'p_transform=%s' % p_transform)
    p = None
    if p_columns is None: # Apply transformamtion to the whole dataset
        p = transform.fit_transform(p_df)
        p = pd.DataFrame(data = p, columns = p_df.columns)
    else:
        p = p_df.copy()
        for column in p_columns:
            p[column] = pd.DataFrame(transform.fit_transform(pd.DataFrame(p[column])), columns = [column])
    print('kaggle_data_transform: Dataset Head:')
    print(p.head())
    
    print('kaggle_data_transform: Done')
    return p
    # End of function kaggle_data_transform

The kaggle_feature_selection() function reorganize the whole dataset to put the fetaures first and the outputs after. 
If the parameter p_attributes is None, all the features excepted the outputs are concidered as inputs of the ML.

In [None]:
# b) Feature Selection
def kaggle_feature_selection(p_df: pd.core.frame.DataFrame, p_target: str, p_attributes: list = None) -> pd.core.frame.DataFrame:
    """
    Reorganize the dataset to keep only provided attributes, the target column is the last column of the new dataset
    :parameters p_df: The dataset handle
    :parameter p_target The outputs of the Machine Learning
    :parameter p_attributes The inputs of the Machine Learning. Default: None, means that all the features are used to train and validate the model
    :return: The dataset after features selection
    """
    print('----------------------------- kaggle_feature_selection -----------------------------')
    p = p_df.copy()
    y_values = p[[p_target]]
    print('----------------------------- kaggle_feature_selection: y_values:')
    print(y_values.head())
    p.drop(p_target, inplace = True, axis = 1)
    x_values = None
    print('kaggle_feature_selection: p_attributes: ', p_attributes)
    if p_attributes is None:
        x_values = p
    else:
        x_values = p[p_attributes]
    print('----------------------------- kaggle_feature_selection: x_values:')
    print(x_values.head())
    p = pd.concat([x_values, y_values], axis=1)
    print('----------------------------- kaggle_feature_selection: new dataset:')
    print(p.head())

    print('kaggle_feature_selection: Done')
    return p
    # End of function kaggle_feature_selection

After cleaning and transforming the initial dataset, we can use it to train and validate our ML. So, The next step is to shuffle our dataset in three different 'sub-datasets' (point 4.a):
1. The training dataset, used to evaluate the ML models
2. The validation dataset, used to validate the selected model
3. The test dataset, use to test the model

In [None]:
# 4. Evaluate Algorithms
# a) Split-out validation dataset
def kaggle_split_dataset(p_df: pd.core.frame.DataFrame, p_validation_size: int = 0.20, p_random_state: int = SEED_HARCODED_VALUE) -> list:
    """
    The dataset is shuffled in three different 'sub' datasets (4.a): 1) the training dataset, 2) the validation dataset, 3) the test dataset
    10% of the dataset is used for testing the model
    :parameter p_df: The raw dataset
    :parameter p_validation_size: The amount of data for training and validation. Default: 0.2 (i.e. 20%) of the 90 % of the dataset will be used for validation, 80% of the 90% of dataset will be used for training.    :return: The list of the three datasets (x_training, x_validation, y_training, y_validation, x_test, y_test)
    """
    print('----------------------------- kaggle_split_dataset -----------------------------')
    # First, extract test samples (10% of the full dataset)
    p = p_df.copy()
    df_values_columns = p.columns
    s = p.sample(frac = 0.1, random_state = p_random_state)
    df_values = s.values
    test_inputs = df_values[:,0:len(p.columns) - 1] # X contains the first columns, this is the inputs of the ML
    test_inputs_columns = df_values_columns[:len(p.columns) - 1]
    test_outputs = df_values[:,len(p.columns) - 1]  # Y contains the Target column, this is the output, this is the output of the ML
    test_outputs_columns = [df_values_columns[-1]]
    p = p[~p.isin(s)].dropna() # Remove samples
    print('----------------------------- kaggle_split_dataset: test_inputs/test_outputs')
    print(test_inputs_columns)
    print(test_outputs_columns)
    print(test_inputs[:5])
    print(test_outputs[:5])
    # Extract training and validation dataset
    df_values = p.values
    ml_inputs = df_values[:,0:len(p.columns) - 1] # X contains the first columns, this is the inputs of the ML
    ml_outputs = df_values[:,len(p.columns) - 1]  # Y contains the Target column, this is the output, this is the output of the ML
    print('----------------------------- kaggle_split_dataset: ml_inputs/ml_outputs')
    print(ml_inputs[:5])
    print(ml_outputs[:5])
    result = model_selection.train_test_split(ml_inputs, ml_outputs, train_size = 1 - p_validation_size, random_state = p_random_state)
    result.append(test_inputs)
    result.append(test_outputs)
    result.append([test_inputs_columns, test_outputs_columns])

    print('kaggle_split_dataset: Done')
    return result
    # End of function kaggle_split_dataset

Now we can apply different models (linear, non-linear, ensemble...) to build our ML and evaluate their efficiency (4.b)

TODO Enhance model scoring method for both Regressor & Classifier


In [None]:
# b) Check models
def kaggle_check_models(p_models: list, p_inputs_training_df: pd.core.frame.DataFrame, p_outputs_training_df: pd.core.frame.DataFrame, p_kparts: int = 10, p_random_state: int = SEED_HARCODED_VALUE, p_cross_validation: str = 'k-fold', p_scoring: str = 'accuracy') -> list:
    """
    Apply different models to train our Machine Learning and evaluate their efficiency
    :parameter p_models: A list of models to use for to train the Machine Learning
    :parameter p_inputs_training_df: The training inputs dataset (training attributes)
    :parameter p_outputs_training_df: The training output dataset (training target)
    :parameter p_inputs_valid_df: The validation inputs dataset (validation attributes)
    :parameter p_outputs_valid_df: The validation output dataset (validation target)
    :parameter p_kparts: 
    :parameter p_random_state: 
    :parameter p_cross_validation: 
    :parameter p_scoring: 
    :return: The list of couple (result, model name)
    """
    print('----------------------------- kaggle_check_models -----------------------------')
    results = []
    names = []
    for name, model in p_models:
        print('kaggle_check_models: Processing %s with type %s' % (name, type(model)))
        # Create train/test indices to split data in train/test sets
        if p_cross_validation == 'k-fold':
            kfold = model_selection.KFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        elif p_cross_validation == 'k-fold':
            kfold = model_selection.StratifiedKFold(n_splits = p_kparts, random_state = p_random_state, shuffle = True) # K-fold Cross Validation
        else:
            raise Exception('kaggle_check_models: Wrong parameters', 'p_cross_validation:%s' % p_cross_validation)
        cv_results = None
        # Evaluate model performance
        if p_cross_validation == 'k-fold' or p_cross_validation == 'k-stratifie-kfold':
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p_outputs_training_df, cv = kfold, scoring = p_scoring)
        else:
            cv_results = model_selection.cross_val_score(model, p_inputs_training_df, p_outputs_training_df, cv = LeaveOneOut(), scoring = p_scoring)
        print('kaggle_check_models: cv_result=%s' % str(cv_results))
        results.append(cv_results)
        names.append(name)
        msg = 'kaggle_check_models: %s metric: %s: %f (%f)' % (p_scoring, name, cv_results.mean(), cv_results.std())
        print(msg)
        print()
        # End of 'for' loop

    print('kaggle_check_models: Done')
    return results, names
    # End of function kaggle_check_models

Then, we select the best model based on the scoring (4.c)

In [None]:
def kaggle_compare_algorithms_perf(p_names: list, p_metrics: list, p_title: str, p_x_label: str, p_y_label:str) -> int:
    print('----------------------------- kaggle_compare_algorithms_perf -----------------------------')
    # Extract means & std
    means = []
    stds = []
    for i in range (len(p_names)):
        cv_results = p_metrics[i]
        means.append(cv_results.mean())
        stds.append(cv_results.std())
        # End of 'for' statement
    # Display means/standard deviation
    plt.title(p_title)
    plt.xlabel(p_x_label)
    plt.ylabel(p_y_label)
    plt.errorbar(p_names, means, stds, linestyle='None', marker='^')
    #plt.savefig('kaggle_algorithms_comparison.png')
    plt.show()
    # Select the best algorithm
    m = np.array(means)
    maxv = np.amax(m)
    idx = np.where(m == maxv)[0][0]
    print('kaggle_compare_algorithms_perf: Max value: %d:%f +/- %f ==> %s' % (idx, maxv, 2 * stds[idx], p_names[idx]))

    print('kaggle_compare_algorithms_perf: Done')
    return idx
    # End of function kaggle_compare_algorithms_perf

Sorry, work still in progress

In [None]:
# 5. Improve Accuracy
# a) Algorithm Tuning
def kaggle_algorithm_tuning(p_algorithm: list, p_inputs_training_df: pd.core.frame.DataFrame, p_outputs_training_df: pd.core.frame.DataFrame, p_validation_data: list = None):
    print('----------------------------- kaggle_algorithms_tuning -----------------------------')
    print('----------------------------- kaggle_algorithms_tuning: %s' % p_algorithm.__class__.__name__)
    model = p_algorithm[1]
    if not p_algorithm.__class__.__name__.startswith('Keras'):
        model.fit(p_inputs_training_df, p_outputs_training_df)
    else:
        early_stopping = keras.callbacks.EarlyStopping(patience = 5, min_delta = 0.001, restore_best_weights = True)
        history = model.fit(p_inputs_training_df, p_outputs_training_df, validation_data = p_validation_data, epochs = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync, callbacks = [early_stopping])
        print('----------------------------- kaggle_algorithms_tuning: loss/val_loss plot')
        history = pd.DataFrame(history.history)
        history.loc[:, ['loss', 'val_loss']].plot(title="loss/val_loss")
        print('kaggle_compare_algorithms_perf: Minimum Validation Loss: {:0.4f}' & history_df['val_loss'].min())

    print('----------------------------- kaggle_algorithms_tuning: model summary:')
    print(model)
    print('kaggle_compare_algorithms_perf: Done')
    return model
    # End of function kaggle_algorithm_tuning
# b) Ensembles
# 6. Finalize Model
# a) Predictions on validation dataset
def kaggle_validation_prediction(p_model, p_inputs, p_expected_outputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) and compares outputs against expected outputs (Validation) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs: 
    :parameter p_expected_outputs: 
    """
    print('----------------------------- kaggle_validation_prediction -----------------------------')
    print('kaggle_validation_prediction: model=%s - is_class:%s - is_regr:%s' % (p_model.__class__.__name__, str(is_classifier(p_model)), str(is_regressor(p_model))))
    y_predictions = p_model.predict(p_inputs)
    if is_regressor(p_model) or p_model.__class__.__name__ == 'KerasRegressor': # Regression metrics (continuous target values)
        print('kaggle_validation_prediction: Model R2 score=%f' % (r2_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: : Model Mean absolute error regression loss (MAE): %0.4f' % mean_absolute_error(p_expected_outputs, y_predictions))
        print('kaggle_validation_prediction: : Model Mean squared error regression loss (MSE): %0.4f' % mean_squared_error(p_expected_outputs, y_predictions))
        print('kaggle_validation_prediction: : Mean squared error regression loss (RMSE): %0.4f' % np.sqrt(mean_squared_error(p_expected_outputs, y_predictions)))
        # Analyze residual errors
        plt.scatter(p_expected_outputs, y_predictions)
        plt.show()
        # TODO Interpreting the Cofficients if possible
    elif is_classifier(p_model) or p_model.__class__.__name__ == 'KerasClassifier': # Classification metrics (class target values)
        print('kaggle_validation_prediction: accuracy=%s' %(accuracy_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: ROC=%s' %(roc_auc_score(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: Confusion_matrix:%s' % str(confusion_matrix(p_expected_outputs, y_predictions)))
        print('kaggle_validation_prediction: Classification report:\n%s' % str(classification_report(p_expected_outputs, y_predictions)))
    else:
        raise Exception('kaggle_validation_prediction: Invalid model')
    print('kaggle_validation_prediction: prediction is %s' % (str(y_predictions)))

    print('kaggle_validation_prediction: Done')
    return y_predictions
    # End of function kaggle_validation_prediction

def kaggle_prediction(p_model, p_inputs) -> np.ndarray:
    """
    Executes prediction (p_inputs) using the specified ML model
    :parameter p_model: 
    :parameter p_inputs:  
    """
    print('----------------------------- kaggle_prediction -----------------------------')
    inputs = []
    inputs.append(p_inputs)
    y_prediction = p_model.predict(inputs)
    print('kaggle_prediction: prediction is %s' %(str(y_prediction)))
    print('kaggle_prediction: Done')
    return y_prediction
# b) Create standalone model on entire training dataset
# TODO

The functions below are some helper to save the model and to save our Machine Learning outcomes in Kaggle compete format.

In [None]:
# c) Save model for later use
def kaggle_save_model(p_model, p_file_name:str) -> None:
    """
    Save the provided model in JSON format and the weights on HD5 format
    :parameter p_model: The ML model to save
    :parameter p_file_name: The file name woithout extension file (e.g. './MyModel')
    """
    print('----------------------------- kaggle_save_model -----------------------------')
    # Serialize the model
    pickle.dump(p_model, open(p_file_name + '.pkl', 'wb'))
    print('kaggle_save_model: Done: %s' % (p_file_name + '.pkl'))
    # End of function kaggle_save_model

The function kaggle_explore_ml() provides some insights from our ML.

In [None]:
def kaggle_explore_ml(p_model, p_x_validation: pd.core.frame.DataFrame, p_y_validation: pd.core.frame.DataFrame, p_random_state:int = SEED_HARCODED_VALUE) -> None:
    """
    Apply feature importance concept to our ML 
    :parameter p_model: The predictions to save
    """
    print('----------------------------- kaggle_explore_ml -----------------------------')
    result = permutation_importance(p_model, p_x_validation, p_y_validation, n_repeats = 32, random_state = p_random_state)
    sorted_idx = result.importances_mean.argsort()
    print('----------------------------- kaggle_explore_ml: result:')
    print(sorted_idx)

    fig, ax = plt.subplots()
    ax.boxplot(result.importances[sorted_idx].T, vert = False, labels = p_x_validation.columns[sorted_idx])
    ax.set_title("Permutation Importances (Validation set)")
    fig.tight_layout()
    plt.show()
    print('kaggle_explore_ml: Done')
    # End of function kaggle_explore_ml

The code below is specific to machine learning. It provides callbacks to create DL models.

In [None]:
# Start of main application
DL_INPUT_SHAPE = None

def kaggle_create_sequential_classifier_model(p_optimizer:str = 'adam', p_loss:str = 'binary_crossentropy', p_metrics:list = ['accuracy']) -> tf.keras.Sequential:
    """
    Build a Neural network model for classification
    """
    print('----------------------------- kaggle_create_sequential_classifier_model -----------------------------')
    model = tf.keras.Sequential([
            tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)
    return model
    # End of function kaggle_create_sequential_classifier_model

def kaggle_create_sequential_regressor_model(p_optimizer:str = 'adam', p_loss:str = 'mae', p_metrics:list = ['mae']) -> tf.keras.Sequential:
    """
    Build a Neural network model for regression
    """
    print('----------------------------- kaggle_create_sequential_regressor_model -----------------------------')
    model = tf.keras.Sequential([
            tf.keras.layers.BatchNormalization(input_shape = DL_INPUT_SHAPE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(rate = DL_DROP_RATE),
            tf.keras.layers.Dense(1, activation='relu'),
    ])
    model.compile(optimizer=p_optimizer, loss = p_loss, metrics = p_metrics)
    return model
    # End of function kaggle_create_sequential_regressor_model

Finaly, here is the entry point function and the main call:

In [None]:
def kaggle_main() -> None:
    global DL_INPUT_SHAPE
    
    # Set defaults
    set_seed()
    set_mp_default()
    
    # Current path
    print(os.path.abspath(os.getcwd()))
    # Kaggle current path and files
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

    # Modules version
    modules_versions()

    # Parse arguments. Used only if this notebook code is used as a standalone Python script
    #flags = ExecutionFlags.NONE
    flags = ExecutionFlags.ALL & ~ExecutionFlags.USE_NEURAL_NETWORK_FLAG & ~ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG & ~ExecutionFlags.DATA_VISUALIZATION_FLAG
    #parser = argparse.ArgumentParser()
    #parser.add_argument('--summarize', help = 'Process statistical analyze', action='store_true')
    #parser.add_argument('--summarize-only', help = 'Process only statistical analyze', action='store_true')
    #parser.add_argument('--visualize', help = 'Generate different plots based on statistical analyze', action='store_true')
    #parser.add_argument('--no-data-cleaning', help = 'Do not apply Data Cleaning', action='store_true')
    #parser.add_argument('--neural-network', help = 'Use neural network as ML', action='store_true')
    #args = parser.parse_args()
    #if args.summarize or args.summarize_only:
    #    flags |= ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG
    #if args.visualize:
    #    flags |= ExecutionFlags.DATA_VISUALIZATION_FLAG
    #if args.no_data_cleaning:
    #    flags |= ~ExecutionFlags.DATA_CLEANING_FLAG
    #if args.neural_network:
    #    flags |= ExecutionFlags.USE_NEURAL_NETWORK_FLAG
    
    # TODO Uncomment if using Pima Indians iabetes dataset
    #flags &= ~ExecutionFlags.DATA_CLEANING_FLAG
    print('generic template approach to ''play'' with the Machine Learning concepts: flags=%s' % str(flags))

    strategy = None
    if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG:
        strategy = kaggle_tpu_detection()

    df = kaggle_load_dataset(p_url = DATABASE_NAME, p_labels = COLUMNS_LABEL)

    # Do a basic ML evaluation as reference for the end
    y_basic_predictions = kaggle_ml_quick_and_dirty(df)

    numerical_columns = None
    categorical_columns = None
    if flags & ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG == ExecutionFlags.DATA_STAT_SUMMURIZE_FLAG:
        kaggle_summurize_data(df)
    #    if args.summarize_only:
    #        return

    if flags & ExecutionFlags.DATA_VISUALIZATION_FLAG == ExecutionFlags.DATA_VISUALIZATION_FLAG:
        kaggle_visualization(df)
        
    if flags & ExecutionFlags.DATA_CLEANING_FLAG == ExecutionFlags.DATA_CLEANING_FLAG:
        df, categorical_column = kaggle_features_engineering(df, p_date_time_columns = DATE_TIME_COLUMNS, p_missing_value_method = 'mean')
        # Extract non  categorical columns based on categorical_column list
        columns_to_transform = list(set(df.columns) - set(categorical_column))
        if not NON_TRANSFORMABLE_COLUMNS is None:
            columns_to_transform = list(set(columns_to_transform) - set(NON_TRANSFORMABLE_COLUMNS))
        print('Main: columns_to_transform = ', columns_to_transform)
        df = kaggle_data_transform(df, columns_to_transform, p_transform = 'scale')

    df = kaggle_feature_selection(df, p_target = TARGET_COLUMN, p_attributes = FEATURES_SELECTION)

    ml_inputs_training_df, ml_inputs_validation_df, ml_outputs_training_df, ml_outputs_validation_df, ml_inputs_test_df, ml_outputs_test_df, columns_list = kaggle_split_dataset(df)
    print('Main: training dataset shape: ', ml_inputs_training_df.shape)
    print('Main: validation dataset shape: ', ml_inputs_validation_df.shape)
    print('Main: test dataset shape: ', ml_inputs_test_df.shape)
    print('Main: columns: ', columns_list)
    
    # Stacking models
    models = []
    DL_INPUT_SHAPE = [ml_inputs_training_df.shape[1]]
    scoring = None
    if OUTPUT_IS_REGRESSION: # Use regression algorithms
        scoring = 'r2' # 'r2' or 'neg_mean_absolute_error'
        models.append(('LR', linear_model.LinearRegression()))
        models.append(('LASSO', linear_model.Lasso()))
        models.append(('EN', linear_model.ElasticNet()))
        models.append(('KNN', neighbors.KNeighborsRegressor()))
        models.append(('CART', tree.DecisionTreeRegressor(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
        models.append(('XGB', xgb.XGBRegressor(random_state = SEED_HARCODED_VALUE)))
        models.append(('RF', ensemble.RandomForestRegressor(n_estimators = 128, random_state = SEED_HARCODED_VALUE)))
        #models.append(('SVR', svm.SVR()))
        if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG: # DL model
            model = KerasRegressor(build_fn = kaggle_create_sequential_regressor_model, epochs = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync)
            models.append(('NRR', model))
    else: # Use classifier algorithms
        scoring = 'accuracy'
        models.append(('LR', linear_model.LogisticRegression()))
        models.append(('LDA', discriminant_analysis.LinearDiscriminantAnalysis()))
        models.append(('KNN', neighbors.KNeighborsClassifier()))
        models.append(('CART', tree.DecisionTreeClassifier(max_leaf_nodes = 256, random_state = SEED_HARCODED_VALUE)))
        models.append(('XGB', xgb.XGBClassifier(random_state = SEED_HARCODED_VALUE)))
        models.append(('RF', ensemble.RandomForestClassifier(n_estimators = 128, random_state = SEED_HARCODED_VALUE)))
        models.append(('NB', naive_bayes.GaussianNB(random_state = SEED_HARCODED_VALUE)))
        models.append(('SVM', svm.SVC(random_state = SEED_HARCODED_VALUE)))
        if flags & ExecutionFlags.USE_NEURAL_NETWORK_FLAG == ExecutionFlags.USE_NEURAL_NETWORK_FLAG: # DL model
            model = KerasClassifier(build_fn = kaggle_create_sequential_classifier_model, epochs = DL_EPOCH_NUM, batch_size = DL_BATCH_SIZE * strategy.num_replicas_in_sync)
            models.append(('NNC', model))

    results, names = kaggle_check_models(models, ml_inputs_training_df, ml_outputs_training_df, p_scoring = scoring)
    best_alg = kaggle_compare_algorithms_perf(names, results, 'Algorithms Comparison', 'Algorithms', 'Accuracy')
    ml = kaggle_algorithm_tuning(models[best_alg], ml_inputs_training_df, ml_outputs_training_df, (ml_inputs_validation_df, ml_outputs_validation_df))
    y_predictions = kaggle_validation_prediction(ml, ml_inputs_validation_df, ml_outputs_validation_df)
    y_predictions = kaggle_validation_prediction(ml, ml_inputs_test_df, ml_outputs_test_df)
    
    kaggle_save_model(ml, '/kaggle/working/' + ml.__class__.__name__)
    
    # Exploring the ML
    kaggle_explore_ml(ml, pd.DataFrame(ml_inputs_validation_df, columns = columns_list[0]), pd.DataFrame(ml_outputs_validation_df, columns = columns_list[1]))
    
    # End of function kaggle_main

Ouf, now, we can execute all the sequences described above and get some result:

In [None]:
# Entry point
kaggle_main()

If you liked this Notebook, please upvote.
Gives Motivation to make new Notebooks :)