# GPLearn - Genetic Programming for regression problems

> Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.
[Introduction to GP](https://gplearn.readthedocs.io/en/stable/intro.html#introduction-to-gp)

This Kernel is an adoption of my previous work [LANL GP Regression](https://www.kaggle.com/elvenmonk/lanl-gp-regression) for [LANL Earthquake Prediction](https://www.kaggle.com/c/LANL-Earthquake-Prediction) competition, applied to this playground data.

Initial kernel was built based solely on my personal domain knowledge, and didn't include proper Feature Analysis and Engineering, that could help improve final result.

I'm open to any suggestions and ideas on how I can improve feature or fuction set, what analysis and visualization tools and approaches can be used to get better understanding of features, their relations, etc. 

Genetic Programming approach is very sensitive to selection of functions, that will be used to build the result.
Good understanding of feature value distributions and their possible relations to target can be crusial not only to accurasy of the final result but also to overall performance of computations.

GPLearn I think is one of the great libraries for modeling of Genetic Programming solution.
You can read more about it in the official [GPLearn Docs](https://gplearn.readthedocs.io/en/stable/examples.html#example-1-symbolic-regressor)

In [None]:
import numpy as np
import pandas as pd
import time
start_time = time.time()

train_df = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv', index_col='id')
train_df = train_df[train_df['target'] != 0]
X_test = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv', index_col='id')
X_train = train_df[train_df.columns[:-1]]
Y_train = train_df[train_df.columns[-1:]]
print(X_test.shape)
print(X_train.shape)
print(Y_train.shape)

## Feature Distribution

Below code is borrowed from [Handling Multimodal Distributions & FE Techniques](https://www.kaggle.com/iamleonie/handling-multimodal-distributions-fe-techniques). Please check out this great Kernel, if have not done so yet.

Diagrams below shows comparison of value distributions for same feature in train (blue) and test (orange) datasets.  

These visualizations are very useful to make sure that all features have similar distribution in train and test datasets. Otherwise we could not rely on appropriate feature to help predict the test target when trained on data with different distribution.

Also it gives us some clues later when we choose set of functions to be used for genetic programming.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(Y_train)

f, ax = plt.subplots(nrows=2, ncols=7, figsize=(21, 6))

for r in range(2):
    for c in range(7):
        n = 7*r+c
        ax[r, c].set(ylim=(0, 4))
        sns.distplot(X_train[f'cont{n+1}'], ax=ax[r, c])
        sns.distplot(X_test[f'cont{n+1}'], ax=ax[r, c])

plt.tight_layout()
plt.show()

# Add GMM class features

As done in [Handling Multimodal Distributions & FE Techniques](https://www.kaggle.com/iamleonie/handling-multimodal-distributions-fe-techniques), we can use Gausian mixture to classify each feature and add class information as separate features.

For Genetic Programming it might be better to have multiple boolean features (yes/no for each class) then deside what value to assign to every class.

In [None]:
from sklearn.mixture import GaussianMixture
from tqdm.auto import tqdm

def get_gmm_class_feature(name, n):
    gmm = GaussianMixture(n_components=n, random_state=42)
    gmm.fit(X_train[name].values.reshape(-1, 1))
    X_train_class = gmm.predict(X_train[name].values.reshape(-1, 1))
    X_test_class = gmm.predict(X_test[name].values.reshape(-1, 1))

    for i in range(n):
        X_train[f'{name}_{i}'] = 1 * (X_train_class == i)
        X_test[f'{name}_{i}'] = 1 * (X_test_class == i)

n_classes = [4, 11, 3, 3, 9, 5, 1, 3, 5, 2, 2, 2, 5, 6]
for n in tqdm(range(14), total=14):
    get_gmm_class_feature(f'cont{n+1}', 12)

In [None]:
f, ax = plt.subplots(nrows=2, ncols=7, figsize=(21, 6))

for r in range(2):
    for c in range(7):
        n = 7*r+c
        ax[r, c].set(ylim=(0, 4+2*12))
        X = X_train[f'cont{n+1}']
        for i in range(12):
            sns.distplot(X[X_train[f'cont{n+1}_{i}'] == 1], ax=ax[r, c])
        #sns.distplot(X_test[f'cont{n}'], ax=ax[r, c])

plt.tight_layout()
plt.show()

## Optional feature normalization

Sometimes we need to normalize (scale to fit or center within `[0-1)` interval) source feature data for some of the regression algorithms in order to get meaningful results.

For GPLearn regressor, this is not necessary as it can decide to multiply and shift any feature by specific constant during evolution.
Still it can be very useful because it saves a lot of time for parameter optimization for each feature and reduces chances for good formula to be rejected or not selected because of [Bloat fight techniques](https://gplearn.readthedocs.io/en/stable/intro.html#bloat) used.

This step doesn't seem to be necessary for this competition, because all of the data seems to be normalized already, but I left this step as a guidance for someone adopting this tecnique for some other data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

X_scaler = MinMaxScaler()
X_scaler.fit(X_train)
X_train_scaled = pd.DataFrame(X_scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

Y_scaler = MinMaxScaler()
Y_scaler.fit(Y_train[Y_train!=0])
Y_train_scaled = pd.DataFrame(Y_scaler.transform(Y_train), columns=Y_train.columns, index=Y_train.index)

sns.distplot(Y_train_scaled)
plt.show()

## Define GPLearn functions

Taking the multimodal nature of the features it might be a good idea to somehow split the features into unimodal components.
Apart from basic ariphmetic operations provided by GPLearn, following additional functions can be useful:
* Compare one feature to another and return 0 or 1

This allows to produce some more complex relations, like crop feature from one or both sides, select one or other feature based on condition, which otherwise would require introducing ternary or 4+-nary operations.
Ternary operations may be supported by engine, but would make expression tree very complex and slow to evolve.

*Note: x10 and x0.1 multipliers are given for example and will not be used for the solution*

In [None]:
from gplearn.functions import make_function
from joblib import wrap_non_picklable_objects

@wrap_non_picklable_objects
def _less(x1, x2):
    return 1*(x1 < x2)

@wrap_non_picklable_objects
def _crop_less(x1, x2):
    return x1*(x1 < x2)

@wrap_non_picklable_objects
def _crop_more(x1, x2):
    return x1*(x1 > x2)

@wrap_non_picklable_objects
def _x10(x1):
    return 10*x1

@wrap_non_picklable_objects
def _x01(x1):
    return 0.1*x1

less = make_function(function=_less, name='less', arity=2)
crop_less = make_function(function=_crop_less, name='crop_less', arity=2)
crop_more = make_function(function=_crop_more, name='crop_more', arity=2)
tanh = make_function(function=np.tanh, name='tanh', arity=1)
sqr = make_function(function=np.square, name='sqr', arity=1)
x10 = make_function(function=_x10, name='x10', arity=1)
x01 = make_function(function=_x01, name='x01', arity=1)
function_set=('add', 'mul', 'neg', 'inv', less, crop_less, crop_more, 'sqrt', sqr, 'log', 'cos', 'sin', tanh) #'sub', 'div', 'min', 'max', 'abs', 'tan')

## Function visualization

Let's see how our functions change initial feature.

In [None]:
f, ax = plt.subplots(nrows=2, ncols=3, figsize=(18, 6))
for r in range(2):
    for c in range(3):
        ax[r, c].set(ylim=(0, 4))
        ax[r, c].set(xlim=(0, 1))

sns.distplot(X_train['cont1'], ax=ax[0, 0])
sns.distplot(np.tanh(X_train['cont1']), ax=ax[0, 1], axlabel='tanh')
sns.distplot(np.square(X_train['cont1']), ax=ax[0, 2], axlabel='square')
sns.distplot(X_train['cont1']*_less(0.25, X_train['cont1']) * _less(X_train['cont1'], 0.75), ax=ax[1, 0], axlabel='(0.25,0.75)')
sns.distplot(_crop_less(X_train['cont1'], 0.5), ax=ax[1, 1], axlabel='<0.5')
sns.distplot(_crop_more(X_train['cont1'], 0.5), ax=ax[1, 2], axlabel='>0.5')

plt.tight_layout()
plt.show()

## Model

Hyperparameter selection is an interesting and quite challenging task. Feel free to play with them after reading official documentation.

I significantly reduced crossover chance, increased mutation and kept tournament size low. This is because I want evolution to keep trying different random changes instead of fast converging into local suboptimal solution.

In [None]:
import os
import pickle
from sklearn.metrics import mean_squared_error
from gplearn.genetic import SymbolicRegressor, SymbolicClassifier

def train_model(X=X_train, X_test=X_test, Y=Y_train['target'], model_name='gpl', verbose=False, classifier=False):
    if verbose: print(f'Starting training of {model_name} model')
    # Load previous execution results to continue training
    if os.path.isfile(f'{model_name}_model.pkl'):
        with open(f'{model_name}_model.pkl', 'rb') as f:
            model = pickle.load(f)
        model.set_params(generations=len(model._programs)+25, warm_start=True)
    else:
        # or initialize model
        if classifier:
            model = SymbolicClassifier(population_size=1000, generations=25, random_state=17, verbose=1, low_memory=True,
                       p_crossover=0.2, p_subtree_mutation=0.3, p_hoist_mutation=0.1, p_point_mutation=0.2,
                       parsimony_coefficient=0.000005, max_samples=1, tournament_size = 25, n_jobs=-1,
                       init_depth=(5, 10), init_method='full', const_range=(0.,1.), function_set=function_set)
        else:
            model = SymbolicRegressor(population_size=1000, generations=25, random_state=17, verbose=1, low_memory=True,
                           p_crossover=0.2, p_subtree_mutation=0.3, p_hoist_mutation=0.1, p_point_mutation=0.2,
                           parsimony_coefficient=0.000005, max_samples=1, tournament_size = 25, n_jobs=-1,
                           init_depth=(5, 10), init_method='full', const_range=(0.,1.), function_set=function_set, metric='rmse')
    # Train/predict
    model.fit(X, Y)
    if verbose: print(model)
    P_train = model.predict(X_train)
    P_test = model.predict(X_test)
    if verbose: print('Train score: {0:.4f}.'.format(mean_squared_error(P_train, Y if classifier else Y_train['target'], squared=False)))

    # Save results and model
    pd.DataFrame(P_train, columns=['target'], index=X_train.index).to_csv(f'{model_name}_train_predictions.csv', index=True)
    pd.DataFrame(P_test, columns=['target'], index=X_test.index).to_csv(f'{model_name}_test_predictions.csv', index=True)
    with open(f'{model_name}_model.pkl', 'wb') as f:
        pickle.dump(model, f)

# Direct prediction

First let's try to predict target directly

In [None]:
DURATION = 4 * 3600

train_model(X=X_train, X_test=X_test, Y=Y_train['target'], model_name='gpl', verbose=True)

while time.time() - start_time < DURATION:
    train_model(X=X_train, X_test=X_test, Y=Y_train['target'], model_name='gpl', verbose=True)

## Visualize program

We can visualize resulting program with graphviz tool

In [None]:
import graphviz
from IPython.display import SVG

def draw_model(model_name):
    with open(f'{model_name}_model.pkl', 'rb') as f:
        model = pickle.load(f)
    dot_data = model._program.export_graphviz()
    graph = graphviz.Source(dot_data)
    display(SVG(graph.pipe(format='svg')))

draw_model('gpl')

# Split Data by target

As suggested in [Handling Multimodal Distributions & FE Techniques](https://www.kaggle.com/iamleonie/handling-multimodal-distributions-fe-techniques), target data is likely bimodal.

Splitting training data by Gaussian Mixture, and predicting target separately for each class can give more accurate predition results.

In [None]:
gmm = GaussianMixture(n_components=2, random_state=17)
gmm.fit(Y_train)
Y_class = gmm.predict(Y_train)
Y_train0 = Y_train[Y_class == 0]
Y_train1 = Y_train[Y_class == 1]

X_train0 = X_train[Y_class == 0]
X_train1 = X_train[Y_class == 1]

f, ax = plt.subplots(nrows=1, ncols=1)

sns.distplot(Y_train0, ax=ax)
sns.distplot(Y_train1, ax=ax)

plt.show()

## Is there a correlation between target class and features?

Let's find out if any correlation exist between the target class we have just defined and feature distribution.

This should answer a question whether or not target class can be inferred from any single feature.

In [None]:
f, ax = plt.subplots(nrows=2, ncols=7, figsize=(21, 6))

for r in range(2):
    for c in range(7):
        n = 7*r+c
        ax[r, c].set(ylim=(0, 4+2*n_classes[n]))
        sns.distplot(X_train0[f'cont{n+1}'], ax=ax[r, c])
        sns.distplot(X_train1[f'cont{n+1}'], ax=ax[r, c])

plt.tight_layout()
plt.show()

## No correlation?

From above distributions it look's like even if correlation exists, it's minimal for any single feature. 

Hope still exist that there is a better correlation with some combination of features.

The only single feature, that might have distinct distribution is 'cont2' as it looks like partially discrete feature. Let's visualize it separately.

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1)

sns.distplot(X_train0[f'cont2'], ax=ax)
sns.distplot(X_train1[f'cont2'], ax=ax)

plt.show()

## Prediction by class

Now we can start training 3 independent models:
* classifier that could predict target class (which of 2 distribution characteristics each sample has)
* For each class predict appropriate target value

Let's first see if data split by target class can be predicted more precisely.

In [None]:
train_model(X=X_train0, X_test=X_test, Y=Y_train0['target'], model_name='gpl0', verbose=True)
train_model(X=X_train1, X_test=X_test, Y=Y_train1['target'], model_name='gpl1', verbose=True)

draw_model('gpl0')
draw_model('gpl1')

## Prediction of a target class

Nice! It looks like each subclass of data can be predicted quite precisely.

So can we also predict target class itsels. To do so we will use `SymbolicClassifier` version of model defined above.

Classifier uses log-loss metric, which for binary classification should have value slightly lower than `-ln(0.5)=0.69314718056` in order to be useful.

In [None]:
DURATION = 8 * 3600

train_model(X=X_train, X_test=X_test, Y=Y_class, model_name='gplc', verbose=True, classifier=True)

while time.time() - start_time < DURATION:
    train_model(X=X_train, X_test=X_test, Y=Y_class, model_name='gplc', verbose=True, classifier=True)

draw_model('gplc')

## Verify combined prediction

Trained program for each target class have made a prediction for each sample.
Resulting target can be built by choosing subclass prediction by predicted target class.

Let's calculate final training score

In [None]:
P_train_class = pd.read_csv('gplc_train_predictions.csv', index_col='id', dtype=np.float64)
P_train0 = pd.read_csv('gpl0_train_predictions.csv', index_col='id', dtype=np.float64)
P_train1 = pd.read_csv('gpl1_train_predictions.csv', index_col='id', dtype=np.float64)
P_train = P_train0
P_train['target'] = np.where(P_train_class['target'] == 0, P_train0['target'], P_train1['target'])
print('Train score: {0:.4f}.'.format(mean_squared_error(P_train['target'], Y_train['target'], squared=False)))

## Submit combined predictions

Due to poor target class prediction, resulting train prediction score is quite bad. Because train and test rows are probably sampled from the same dataset, we can expect similar score from final submission.

Similarly final prediction is produced for test data.

In [None]:
P_test_class = pd.read_csv('gplc_test_predictions.csv', index_col='id', dtype=np.float64)
P_test0 = pd.read_csv('gpl0_test_predictions.csv', index_col='id', dtype=np.float64)
P_test1 = pd.read_csv('gpl1_test_predictions.csv', index_col='id', dtype=np.float64)
P_test = P_test0
P_test['target'] = np.where(P_test_class['target'] == 0, P_test0['target'], P_test1['target'])
P_test.to_csv("submission_c01.csv", index=True)
print(P_test)