## Introduction

The Spaceship Titanic is a Machine Learning competition currently running on Kaggle with the participation of more than 2,000 teams.

As final project in this module, we will run a small competition using a smaller dataset, and you will need to use GP or GE to evolve a solution.

## Description

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

In [1]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

The dataset provides 12 input variables that are a mixture of categorical, ordinal, boolean and numerical data types:

1. PassengerId
2. HomePlanet
3. CryoSleep
4. Destination
5. Age
6. VIP
7. RoomService
8. FoodCourt
9. ShoppingMall
10. Spa
11. VRDeck
12. Name


This is a binary classification problem where the task is to predict whether a passenger was transported to an alternate dimension. 

### Load the dataset

In [2]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [3]:
## mount your Google drive
# 1) click on the link
# 2) sign in
# 3) copy the provided code
# 4) paste it in the text box bellow
# 5) click the folder icon at the right
# 6) verify your drive is mounted

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Clone the GRAPE repository at first because the dataset to be used is already there.

In [4]:
import os
# Get the library from our BDS research Group
# copy the path from your drive
PATH = '/content/drive/u/1/folders/1--2pl_taR8wZpcFAvbQNWaQen1afSZQF'

# check if 'grape' already exists
if os.path.exists(PATH):
    print('grape directory already exists')
else:
    %cd /content/drive/MyDrive/
    !git clone https://github.com/UL-BDS/grape.git 
    print('Cloning grape in your Drive')

# change directory to 'grape'
%cd /content/drive/MyDrive/grape/

/content/drive/MyDrive
fatal: destination path 'grape' already exists and is not an empty directory.
Cloning grape in your Drive
/content/drive/MyDrive/grape


### Train set

In [5]:
train_file = 'datasets/spaceshipTitanic_train.csv'

In [6]:
# load train set
import pandas as pd
df_train = pd.read_csv('/content/drive/MyDrive/grape/datasets/spaceshipTitanic_train.csv')
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0,Earth,False,55 Cancri e,22,False,0,833,381,0,12,Miranda Pratt,True
1,1,Mars,True,TRAPPIST-1e,61,False,0,0,0,0,0,Isaac Werner,True
2,2,Mars,True,TRAPPIST-1e,5,False,0,0,0,0,0,Elisha Rosario,True
3,3,Earth,False,55 Cancri e,14,False,653,0,4,0,0,Deshawn Hall,False
4,4,Earth,False,PSO J318.5-22,2,False,0,0,0,0,0,Justice Archer,True


In [7]:
df_train.describe()

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,999.5,28.5555,213.4605,497.9025,166.237,342.252,269.211
std,577.494589,14.629112,615.762402,1763.257082,509.568841,1236.474773,1021.074852
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,499.75,20.0,0.0,0.0,0.0,0.0,0.0
50%,999.5,26.0,0.0,0.0,0.0,0.0,0.0
75%,1499.25,37.0,32.0,61.25,23.0,67.25,37.0
max,1999.0,79.0,6899.0,27723.0,10424.0,18572.0,14485.0


In [8]:
X_train = df_train.copy()
# warning: cannot drop it more than once
X_train.drop(['Transported'], axis=1, inplace=True)

In [9]:
# class labels
import numpy as np
l, _ = X_train.shape

y_train = np.zeros([l,], dtype=bool)

for i in range(l):
  y_train[i] = df_train['Transported'].iloc[i]

In [10]:
#y_train.head()
print(y_train[0:5])

[ True  True  True False  True]


### Test set

In [11]:
test_file = 'datasets/spaceshipTitanic_test.csv'

In [12]:
# load test set
df_test = pd.read_csv('/content/drive/MyDrive/grape/datasets/spaceshipTitanic_test.csv')
df_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,2000,Mars,False,TRAPPIST-1e,54,False,676,0,231,379,0,Dawson Knox
1,2001,Mars,False,TRAPPIST-1e,43,False,336,11,796,15,0,Jaylee Navarro
2,2002,Europa,False,55 Cancri e,33,False,77,2381,0,3656,150,Dario Hart
3,2003,Earth,True,55 Cancri e,30,False,0,0,0,0,0,Alden Parker
4,2004,Europa,False,TRAPPIST-1e,31,False,0,53,0,2963,1017,Gina Frank


In [13]:
df_test.describe()

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,4923.0,4923.0,4923.0,4923.0,4923.0,4923.0,4923.0
mean,4461.0,29.028235,231.206175,473.335568,184.646354,308.407678,317.807434
std,1421.292018,14.466997,696.138873,1634.705363,677.528376,1126.346091,1164.989135
min,2000.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3230.5,19.0,0.0,0.0,0.0,0.0,0.0
50%,4461.0,27.0,0.0,0.0,0.0,0.0,0.0
75%,5691.5,38.0,59.0,84.0,31.0,63.0,59.5
max,6922.0,79.0,14327.0,29813.0,23492.0,22408.0,20336.0


In [14]:
df_test

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,2000,Mars,False,TRAPPIST-1e,54,False,676,0,231,379,0,Dawson Knox
1,2001,Mars,False,TRAPPIST-1e,43,False,336,11,796,15,0,Jaylee Navarro
2,2002,Europa,False,55 Cancri e,33,False,77,2381,0,3656,150,Dario Hart
3,2003,Earth,True,55 Cancri e,30,False,0,0,0,0,0,Alden Parker
4,2004,Europa,False,TRAPPIST-1e,31,False,0,53,0,2963,1017,Gina Frank
...,...,...,...,...,...,...,...,...,...,...,...,...
4918,6918,Earth,True,PSO J318.5-22,46,False,0,0,0,0,0,Lamar Knight
4919,6919,Europa,True,TRAPPIST-1e,15,False,0,0,0,0,0,Yadira Pittman
4920,6920,Earth,False,55 Cancri e,20,False,0,0,0,335,957,Maliyah Morgan
4921,6921,Earth,False,PSO J318.5-22,42,False,0,168,0,113,461,Brynlee Gilbert


In [15]:
X_test = df_test.copy()

We need to prepare both training and test datasets before working with a Machine Learning method. 

Consider the following tips:

1.   Remove columns that you think does not influence the class label (for example 'PassengerId');
2.   Use some encoding method with categorical data.

You are free to use any other pre-processing ideas. 

You could use for instance, one-hot encoding with categorical data, as was shown when we studied the heart disease dataset.


Number of categories on each categorical data:



1.   HomePlanet: 3
2.   Destination: 3



In [16]:
X_train1 = X_train[['HomePlanet','CryoSleep','Destination','VIP','Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].copy()

In [17]:
X_train1['Total Spending']=X_train1['RoomService']+X_train1['FoodCourt']+X_train1['ShoppingMall']+X_train1['Spa']+X_train1['VRDeck']

In [18]:
X_train2=X_train1[['HomePlanet','CryoSleep','Destination','VIP','Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Total Spending']].copy()

In [19]:
X_train2

Unnamed: 0,HomePlanet,CryoSleep,Destination,VIP,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Total Spending
0,Earth,False,55 Cancri e,False,22,0,833,381,0,12,1226
1,Mars,True,TRAPPIST-1e,False,61,0,0,0,0,0,0
2,Mars,True,TRAPPIST-1e,False,5,0,0,0,0,0,0
3,Earth,False,55 Cancri e,False,14,653,0,4,0,0,657
4,Earth,False,PSO J318.5-22,False,2,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1995,Earth,False,TRAPPIST-1e,False,27,158,0,1271,0,71,1500
1996,Mars,False,55 Cancri e,False,51,40,716,1907,0,0,2663
1997,Mars,False,TRAPPIST-1e,False,32,689,0,610,0,0,1299
1998,Mars,True,TRAPPIST-1e,False,41,0,0,0,0,0,0


In [20]:
X_train2

Unnamed: 0,HomePlanet,CryoSleep,Destination,VIP,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Total Spending
0,Earth,False,55 Cancri e,False,22,0,833,381,0,12,1226
1,Mars,True,TRAPPIST-1e,False,61,0,0,0,0,0,0
2,Mars,True,TRAPPIST-1e,False,5,0,0,0,0,0,0
3,Earth,False,55 Cancri e,False,14,653,0,4,0,0,657
4,Earth,False,PSO J318.5-22,False,2,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1995,Earth,False,TRAPPIST-1e,False,27,158,0,1271,0,71,1500
1996,Mars,False,55 Cancri e,False,51,40,716,1907,0,0,2663
1997,Mars,False,TRAPPIST-1e,False,32,689,0,610,0,0,1299
1998,Mars,True,TRAPPIST-1e,False,41,0,0,0,0,0,0


In [21]:
X_test1 = X_test[['HomePlanet','CryoSleep','VIP','Destination','Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].copy()
X_test1['Total Spending']=X_test1['RoomService']+X_test1['FoodCourt']+X_test1['ShoppingMall']+X_test1['Spa']+X_test1['VRDeck']
X_test2=X_test1[['HomePlanet','CryoSleep','Destination','VIP','Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Total Spending']].copy() #making new feature 

In [22]:
X_train2

Unnamed: 0,HomePlanet,CryoSleep,Destination,VIP,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Total Spending
0,Earth,False,55 Cancri e,False,22,0,833,381,0,12,1226
1,Mars,True,TRAPPIST-1e,False,61,0,0,0,0,0,0
2,Mars,True,TRAPPIST-1e,False,5,0,0,0,0,0,0
3,Earth,False,55 Cancri e,False,14,653,0,4,0,0,657
4,Earth,False,PSO J318.5-22,False,2,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1995,Earth,False,TRAPPIST-1e,False,27,158,0,1271,0,71,1500
1996,Mars,False,55 Cancri e,False,51,40,716,1907,0,0,2663
1997,Mars,False,TRAPPIST-1e,False,32,689,0,610,0,0,1299
1998,Mars,True,TRAPPIST-1e,False,41,0,0,0,0,0,0


Preprocessing Pipeline


In [23]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [24]:
# Pipeline for numerical values
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),                              #Declaring StandardScaler encoding for the numerical dataset
    ("scaler", StandardScaler())
])

In [25]:
# Pipeline for categorical attributes

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("cat_encoder", OneHotEncoder(sparse=False))                                #Declaring one hot encoding for the catagorical dataset
])

In [26]:
# Combine the numerical and categorical pipelines

num_attribs = ["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]   #All numerical feature to be given to the pipeline
cat_attribs = ["CryoSleep", "VIP","HomePlanet", "Destination"]                  #All categorical feature to be given to the pipeline

preprocess_pipeline = ColumnTransformer([                                       #Fitting data into the pipeline for scaling and encoding
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

In [27]:
X_train2 = preprocess_pipeline.fit_transform(
    X_train2[num_attribs + cat_attribs]
)
X_train2[0:10]

array([[-0.44822539, -0.34674719,  0.19009214,  0.42156563, -0.27686581,
        -0.2519652 ,  1.        ,  0.        ,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ],
       [ 2.21835841, -0.34674719, -0.2824472 , -0.32631229, -0.27686581,
        -0.26372046,  0.        ,  1.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ],
       [-1.61058243, -0.34674719, -0.2824472 , -0.32631229, -0.27686581,
        -0.26372046,  0.        ,  1.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ],
       [-0.99521693,  0.713992  , -0.2824472 , -0.31846055, -0.27686581,
        -0.26372046,  1.        ,  0.        ,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ],
       [-1.81570426, -0.34674719, -0

In [28]:
X_test2 = preprocess_pipeline.fit_transform(X_test2[num_attribs + cat_attribs])
X_test2[0:10]

array([[ 1.72629481,  0.639009  , -0.28958347,  0.06842275,  0.06268011,
        -0.27282634,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.96586627,  0.1505511 , -0.28285374,  0.90242087, -0.2605216 ,
        -0.27282634,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.2745676 , -0.2215389 ,  1.16709607, -0.27255701,  2.97238335,
        -0.14405669,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ],
       [ 0.067178  , -0.33216025, -0.28958347, -0.27255701, -0.27384035,
        -0.27282634,  0.        ,  1.        ,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ],
       [ 0.13630787, -0.33216025, -0

## GRAPE

<div>
<img src="https://drive.google.com/uc?export=view&id=1hw43Oi3lGTCkspQ0ged2bZB8q2EpcPhz" width="150"/>
</div> 

GRammatical Algorithms in Python for Evolution (GRAPE)


In [30]:
!pip install deap==1.3 

import grape
import algorithms

from os import path
from deap import creator, base, tools
import random
import csv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [31]:
from functions import add, sub, mul, pdiv, psqrt, plog, neg, and_, or_, not_, less_than_or_equal, greater_than_or_equal,nand_,nor_

In [32]:
def pexp(a):                                                                    #making functions we want to use in our grammar file
  return np.square(a)

def pneg(a):
  return np.multiply(a,-1)

def phalf(a):
  return np.multiply(a,0.5)

def psin(a):
  return np.sin(a)

def pcos(a):
  return np.cos(a)

def ptan(a):
  return np.tan(a)

def xor_(a, b):
    return np.logical_xor(a,b)
    
def mean_(a):
    return np.mean(a)

In [34]:
GRAMMAR_FILE = 'SpaceShip_10_feature_final.bnf'                                 #importing grammar file from drive
f = open("grammars/" + GRAMMAR_FILE, "r")
print(f.read())                                                                 #reading and printing the grammar
f.close()  


<log_op> ::= <conditional_branches> | and_(<log_op>,<log_op>) | or_(<log_op>,<log_op>) | not_(<log_op>) | <boolean_feature> | nand_(<log_op>,<log_op>) | nor_(<log_op>,<log_op>) | xor_(<log_op>,<log_op>) | not_(<boolean_feature>) 
<c>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<o> ::= +|-|*
<conditional_branches> ::= less_than_or_equal(<num_op>,<num_op>) | greater_than_or_equal(<num_op>, <num_op>)
<num_op>   ::= add(<num_op>,<num_op>) | sub(<num_op>,<num_op>) | mul(<num_op>,<num_op>) | pdiv(<num_op>,<num_op>) | <nonboolean_feature> | <nonboolean_feature> <o> <nonboolean_feature> | <nonboolean_feature> | <nonboolean_feature> <o> <nonboolean_feature>
<boolean_feature> ::= x[6]|x[7]|x[8]|x[9]|x[10]|x[11]|x[12]|x[13]|x[14]|x[15]
<nonboolean_feature> ::= x[0]|x[1]|x[2]|x[3]|x[4]|x[5]|<c><c>.<c><c>|mean_(<nonboolean_feature>)


Run the following cell to put your grammar on the class Grammar.

In [35]:
BNF_GRAMMAR = grape.Grammar(path.join("grammars", GRAMMAR_FILE))                #utilising the grammar in the grape library

In [36]:
def fitness_eval(individual, points):
    """
    Fitness Function
    """

    x = points[0]
    Y = points[1]
    
    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    compare = np.equal(Y,pred)
    fitness = 1 - np.mean(compare)
   
    return fitness,

In [37]:
X_train2 = np.transpose(X_train2)                                               #using transpose to fit the feature in the evaluation function and printing to see the shape
X_test2 = np.transpose(X_test2)                                                 

print('Training (X,Y):\t', X_train2.shape, y_train.shape)
print('Test (X):\t', X_test2.shape)

Training (X,Y):	 (16, 2000) (2000,)
Test (X):	 (16, 4923)


Set the Grammatical Evolution parameters.

In [38]:

POPULATION_SIZE = 10000                                                         #declaring parameters for the GE to take place
MAX_GENERATIONS = 500
P_CROSSOVER = 0.85
P_MUTATION = 0.05
ELITE_SIZE = 1
HALL_OFFAME_SIZE = 1

TOURNAMENT_SIZE = 10                                                            #using 10 for tournament size
RANDOM_SEED = 42                                                                #fixing seed to 42 as constant
random.seed(RANDOM_SEED) 

CODON_CONSUMPTION = 'eager'
GENOME_REPRESENTATION = 'list'
MAX_GENOME_LENGTH = None

MAX_INIT_TREE_DEPTH = 10
MIN_INIT_TREE_DEPTH = 5
MAX_TREE_DEPTH = 250
MAX_WRAPS = 1                                                                   #using max wrap 1 to allow wrapping 
CODON_SIZE = 255

REPORT_ITEMS = ['gen', 'invalid', 'avg', 'std', 'min', 'max', 
                'best_ind_length', 'avg_length', 
                'best_ind_nodes', 'avg_nodes', 
                'best_ind_depth', 'avg_depth', 
                'avg_used_codons', 'best_ind_used_codons',
                'structural_diversity', 'fitness_diversity',
                'selection_time', 'generation_time']

Create a toolbox.

In [39]:
toolbox = base.Toolbox()

# define a single objective, minimising fitness strategy:
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))                     #setting weights to -1 to minimise the fitness score as objective
creator.create('Individual', grape.Individual, fitness=creator.FitnessMin)

toolbox.register("populationCreator", grape.sensible_initialisation, creator.Individual)  #using sensible initialisation

toolbox.register("evaluate", fitness_eval)

# Tournament selection:
toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)      #declaring selTournament for tournament selection

# Single-point crossover:
toolbox.register("mate", grape.crossover_onepoint)

# Flip-int mutation:
toolbox.register("mutate", grape.mutation_int_flip_per_codon)

In [40]:
# create initial population (generation 0):
population = toolbox.populationCreator(pop_size=POPULATION_SIZE, 
                                           bnf_grammar=BNF_GRAMMAR, 
                                           min_init_depth=MIN_INIT_TREE_DEPTH,
                                           max_init_depth=MAX_INIT_TREE_DEPTH,
                                           codon_size=CODON_SIZE,
                                           codon_consumption=CODON_CONSUMPTION,
                                           genome_representation=GENOME_REPRESENTATION
                                            )

# define the hall-of-fame object:
hof = tools.HallOfFame(HALL_OFFAME_SIZE)

# prepare the statistics object:
stats = tools.Statistics(key=lambda ind: ind.fitness.values)
stats.register("avg", np.nanmean)
stats.register("std", np.nanstd)
stats.register("min", np.nanmin)
stats.register("max", np.nanmax)


Run Grammatical Evolution.

In [41]:
population, logbook = algorithms.ge_eaSimpleWithElitism(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                                              ngen=MAX_GENERATIONS, elite_size=ELITE_SIZE,
                                              bnf_grammar=BNF_GRAMMAR, 
                                              codon_size=CODON_SIZE, 
                                              max_tree_depth=MAX_TREE_DEPTH,
                                              max_genome_length=MAX_GENOME_LENGTH,
                                              points_train=[X_train2, y_train], 
                                              codon_consumption=CODON_CONSUMPTION,
                                              report_items=REPORT_ITEMS,
                                              genome_representation=GENOME_REPRESENTATION,                                              
                                              stats=stats, halloffame=hof, verbose=False)

gen = 0 , Best fitness = (0.27249999999999996,)
gen = 1 , Best fitness = (0.253,) , Number of invalids = 4404
gen = 2 , Best fitness = (0.253,) , Number of invalids = 4937
gen = 3 , Best fitness = (0.253,) , Number of invalids = 5035
gen = 4 , Best fitness = (0.253,) , Number of invalids = 5125
gen = 5 , Best fitness = (0.253,) , Number of invalids = 5135
gen = 6 , Best fitness = (0.253,) , Number of invalids = 5180
gen = 7 , Best fitness = (0.23950000000000005,) , Number of invalids = 5101
gen = 8 , Best fitness = (0.23950000000000005,) , Number of invalids = 5054
gen = 9 , Best fitness = (0.23950000000000005,) , Number of invalids = 4931
gen = 10 , Best fitness = (0.23950000000000005,) , Number of invalids = 4830
gen = 11 , Best fitness = (0.23950000000000005,) , Number of invalids = 4755
gen = 12 , Best fitness = (0.23950000000000005,) , Number of invalids = 4662
gen = 13 , Best fitness = (0.23950000000000005,) , Number of invalids = 4647
gen = 14 , Best fitness = (0.239500000000000

Show the best individual as an expression.

In [42]:
# Best individual
import textwrap
best = hof.items[0].phenotype
print("Best individual: \n","\n".join(textwrap.wrap(best,80)))
print("\nTraining Fitness: ", hof.items[0].fitness.values[0])
print("Depth: ", hof.items[0].depth)
print("Length of the genome: ", len(hof.items[0].genome))
print(f'Used portion of the genome: {hof.items[0].used_codons/len(hof.items[0].genome):.2f}')

Best individual: 
 greater_than_or_equal(add(x[5] * x[1],x[2]), add(mul(pdiv(mean_(x[3]) *
x[4],sub(mul(mul(add(mul(x[1],sub(x[1],31.11 - 42.68)),x[3]),sub(x[2] +
x[4],mul(x[4] - x[3],mean_(x[2])))),x[4]),x[5] + x[4])),sub(sub(x[3] -
x[4],mean_(x[3])),x[0] - x[2])),x[4]))

Training Fitness:  0.23950000000000005
Depth:  15
Length of the genome:  223
Used portion of the genome: 0.34


Define a function to predict values.

In [43]:
def predict(individual, X):
    x = X
    
    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)
    
    _, c = x.shape
    
    try:
        Y_class = [True if pred[i] > 0 else False for i in range(c)]
    except (IndexError, TypeError):
        return np.NaN,
   
    return Y_class

Predict the classes of the test set.

In [44]:
y_pred = predict(hof.items[0], X_test2)
print("Predicted classes of the test set: ", y_pred)                            #check the predictions and printing

Predicted classes of the test set:  [False, False, False, True, False, True, True, False, True, True, True, True, True, False, True, False, False, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, False, False, True, True, True, True, True, True, True, True, False, True, True, False, True, True, False, True, True, True, True, False, False, True, True, True, False, True, True, False, True, True, True, True, False, False, True, True, True, False, True, False, False, False, True, True, False, False, False, False, True, True, False, False, False, False, False, False, True, True, True, True, True, True, False, True, True, False, True, True, False, False, True, False, False, False, True, True, True, False, True, False, True, False, False, True, False, False, False, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, False, True, False, False, False, True, True, True, True, True, True, False, True, False,

Save it in a .csv file and submit it in the Kaggle competition.

The format is as follows:
1. First column is the original `PassengerId` column in the test set;
2. Second column is named `Transported` and contains the predictions (only 0's or 1's).

In [45]:
y_pred_temp=[]                                                                   
for each in y_pred:
  y_pred_temp.append(str(each).upper())                                         #fix for converting true/false to TRUE/FALSE for kaggle
y_pred=y_pred_temp

In [46]:
print(y_pred)

['FALSE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 

In [47]:
df_id = df_test['PassengerId']
df_class = pd.DataFrame(data=y_pred, columns = ['Transported'])
df_pred = pd.concat([df_id, df_class], axis=1)

df_pred.to_csv('predictions_'+str(hof.items[0].fitness.values[0])+'.csv', sep=',', index=False)

In [48]:
from google.colab import files
files.download('predictions_'+str(hof.items[0].fitness.values[0])+'.csv')       #download the predictions with fitness score in name

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>