<div>
<img src="https://www.ul.ie/themes/custom/ul/logo.jpg" />
</div>

# **MSc in Artificial Intelligence and Machine Learning**
## CS6271 - Evolutionary Algorithms and Humanoid Robotics 2023
### Kaggle Competition


Module Leader: Conor Ryan

Developers:  
- Siddharth Prince - 23052058
- Pratik Verma - 23007575

## Introduction

Predict whether income exceeds $50K/yr based on census data. This is a shorter version of the also known as "Census Income" dataset (donated on 4/30/1996).

In [1]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

Class:

income: >50K, <=50K.


Listing of features:

age: continuous.

workclass: categorical (Private, Self-emp-not-inc, Local-gov, State-gov).

education: categorical (Bachelors, Some-college, HS-grad, Masters, Doctorate).

marital-status: categorical (Married-civ-spouse, Divorced, Never-married).

relationship: categorical (Wife, Husband, Not-in-family, Other-relative).

race: categorical (White, Asian-Pac-Islander, Black).

sex: categorical (Female, Male).

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: categorical (United-States, Others).


### Load the dataset

In [2]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [None]:
## mount your Google drive
# 1) run this cell
# 2) sign in
# 3) verify your drive is mounted

# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

Clone the GRAPE repository at first because the dataset to be used is already there.

In [3]:
import os
# Get the library from our BDS research Group
# copy the path from your drive
PATH = './grape/'

# check if 'grape' already exists
if os.path.exists(PATH):
    print('grape directory already exists')
else:
    !git clone https://github.com/bdsul/grape.git
    print('Cloning grape in your Drive')

# change directory to 'grape'
%cd ./grape/

grape directory already exists
/home/sprince0031/UL_AI_ML/SEM_1/CS6271-Evolutionary_Computation_and_Humanoid_Robotics/Project/grape


Now you have a grape folder in your Drive account.

Upload the files adult_training.csv and adult_test.csv to the folder grape/datasets in your Drive before running the next cells.

### Train set

In [4]:
train_file = './datasets/adult_training.csv'

In [5]:
# load train set
df_train = pd.read_csv(train_file)
df_train.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,28,Private,Bachelors,Never-married,Not-in-family,White,Male,0,0,40,United-States,<=50K
1,34,Self-emp-not-inc,Bachelors,Married-civ-spouse,Husband,Black,Male,0,1887,48,United-States,>50K
2,32,Private,Bachelors,Never-married,Not-in-family,Black,Female,0,0,40,United-States,<=50K
3,46,Private,Bachelors,Divorced,Not-in-family,White,Male,0,0,40,Others,<=50K
4,44,Private,Bachelors,Married-civ-spouse,Husband,White,Male,0,0,50,United-States,>50K


In [6]:
df_train.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,5200.0,5200.0,5200.0,5200.0
mean,39.688077,1059.895,109.486346,42.786538
std,11.973363,6687.36408,442.694051,10.937644
min,17.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,2559.0,99.0


In [56]:
X_train = df_train.copy()
# warning: cannot drop it more than once
X_train.drop(['income'], axis=1, inplace=True)

You should represent the outputs with 0 where the income is smaller or equal to 50K and with 1 if it is greater than 50K.

Follow exactly this approach, because the test targets are represented like this in the competition.

In [57]:
# class labels
l, _ = X_train.shape

y_train = np.zeros([l,], dtype=int)

for i in range(l):
  if df_train['income'].iloc[i] == '>50K':
    y_train[i] = 1
  elif df_train['income'].iloc[i] == '<=50K':
    y_train[i] = 0

In [58]:
print(y_train[0:5]) #print head

[0 1 0 0 1]


### Test set

In [59]:
test_file = './datasets/adult_test.csv'

In [60]:
# load test set
df_test = pd.read_csv(test_file)
df_test.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,33,Private,HS-grad,Never-married,Not-in-family,White,Male,3325,0,50,United-States
1,58,Private,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,40,United-States
2,30,Self-emp-not-inc,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,60,United-States
3,26,Private,Some-college,Never-married,Not-in-family,White,Female,0,0,20,United-States
4,43,State-gov,HS-grad,Never-married,Not-in-family,White,Male,0,0,60,United-States


In [61]:
df_test.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,10402.0,10402.0,10402.0,10402.0
mean,39.811575,1280.969237,106.101038,42.749567
std,12.063746,7826.438595,438.826968,11.200949
min,18.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,3683.0,99.0


In [62]:
X_test = df_test.copy()

You will need to prepare both training and test datasets before working with a Machine Learning method.

Consider you need to use some encoding method with categorical data.

You are free to use any other pre-processing ideas.

In [82]:
#Include your code here
# Going with one-hot-encoding
categorical_features = ['workclass', 'education', 'marital-status', 'relationship', 'race', 'sex', 'native-country']
X_train_encoded = pd.get_dummies(X_train, columns=categorical_features).astype(int)
# X_train_encoded
X_test_encoded = pd.get_dummies(X_test, columns=categorical_features).astype(int)
# X_test_encoded

In [83]:
display(X_train_encoded.head(10))
display(X_test_encoded.head(10))

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,28,0,0,40,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1
1,34,0,1887,48,0,0,1,0,1,0,...,0,0,0,0,1,0,0,1,0,1
2,32,0,0,40,0,1,0,0,1,0,...,1,0,0,0,1,0,1,0,0,1
3,46,0,0,40,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,1,0
4,44,0,0,50,0,1,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1
5,41,0,0,60,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
6,28,0,0,40,0,1,0,0,0,0,...,1,0,0,0,1,0,0,1,0,1
7,23,0,1590,48,0,1,0,0,0,0,...,1,0,0,0,0,1,0,1,0,1
8,52,0,0,40,0,1,0,0,0,0,...,1,0,0,0,0,1,1,0,0,1
9,33,0,0,44,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,33,3325,0,50,0,1,0,0,0,0,...,1,0,0,0,0,1,0,1,0,1
1,58,0,0,40,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
2,30,0,0,60,0,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,1
3,26,0,0,20,0,1,0,0,0,0,...,1,0,0,0,0,1,1,0,0,1
4,43,0,0,60,0,0,0,1,0,0,...,1,0,0,0,0,1,0,1,0,1
5,36,0,0,50,0,1,0,0,0,0,...,1,0,0,0,0,1,0,1,0,1
6,52,0,0,22,0,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,1
7,37,0,0,45,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
8,36,0,0,40,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
9,40,0,0,60,0,1,0,0,0,1,...,0,0,0,0,0,1,0,1,0,1


In [64]:
print(X_train_encoded.columns)

Index(['age', 'capital-gain', 'capital-loss', 'hours-per-week',
       'workclass_Local-gov', 'workclass_Private',
       'workclass_Self-emp-not-inc', 'workclass_State-gov',
       'education_Bachelors', 'education_Doctorate', 'education_HS-grad',
       'education_Masters', 'education_Some-college',
       'marital-status_Divorced', 'marital-status_Married-civ-spouse',
       'marital-status_Never-married', 'relationship_Husband',
       'relationship_Not-in-family', 'relationship_Other-relative',
       'relationship_Wife', 'race_Asian-Pac-Islander', 'race_Black',
       'race_White', 'sex_Female', 'sex_Male', 'native-country_Others',
       'native-country_United-States'],
      dtype='object')


We can see that some of the entries have 0 for their capital gain and loss features. Let's investigate what percentage of the data has this.

In [84]:
X_total = pd.concat([X_train_encoded, X_test_encoded], ignore_index=True)
# X_total = X_total[X_total['age'] >= 17]
numerical_features = ['age', 'capital-gain', 'capital-loss', 'hours-per-week']
total_mins = {}
total_maxs = {}
for feature in numerical_features:
    total_mins[feature] = X_total.loc[:, feature].min()
    total_maxs[feature] = X_total.loc[:, feature].max()
print(total_mins, total_maxs)

{'age': 17, 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 1} {'age': 90, 'capital-gain': 99999, 'capital-loss': 3683, 'hours-per-week': 99}


In [85]:
for feature in numerical_features:
    X_train_encoded.loc[:, feature] = (X_train.loc[:, feature] - total_mins[feature]) / (total_maxs[feature] - total_mins[feature])
    X_test_encoded.loc[:, feature] = (X_test.loc[:, feature] - total_mins[feature]) / (total_maxs[feature] - total_mins[feature])
display(X_train_encoded.head())
display(X_test_encoded.head())

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,0.150685,0.0,0.0,0.397959,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1
1,0.232877,0.0,0.512354,0.479592,0,0,1,0,1,0,...,0,0,0,0,1,0,0,1,0,1
2,0.205479,0.0,0.0,0.397959,0,1,0,0,1,0,...,1,0,0,0,1,0,1,0,0,1
3,0.39726,0.0,0.0,0.397959,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,1,0
4,0.369863,0.0,0.0,0.5,0,1,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,0.219178,0.03325,0.0,0.5,0,1,0,0,0,0,...,1,0,0,0,0,1,0,1,0,1
1,0.561644,0.0,0.0,0.397959,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
2,0.178082,0.0,0.0,0.602041,0,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,1
3,0.123288,0.0,0.0,0.193878,0,1,0,0,0,0,...,1,0,0,0,0,1,1,0,0,1
4,0.356164,0.0,0.0,0.602041,0,0,0,1,0,0,...,1,0,0,0,0,1,0,1,0,1


In [None]:
capitalData_zero_rows_train = X_train_encoded[(X_train_encoded['capital-gain'] == 0) & (X_train_encoded['capital-loss'] == 0)]
display(capitalData_zero_rows_train.head(10))
print(f'Number of rows that have no hard data about the capital gain and loss: {capitalData_zero_rows_train.shape[0]}')

In [None]:
capitalData_zero_rows_test = X_test_encoded[(X_test_encoded['capital-gain'] == 0) & (X_test_encoded['capital-loss'] == 0)]
display(capitalData_zero_rows_test.head(10))
print(f'Number of rows that have no hard data about the capital gain and loss: {capitalData_zero_rows_test.shape[0]}')

In [17]:
# X_train_encoded['capital-gain-zero'] = (X_train_encoded['capital-gain'] == 0).astype(int)
# X_train_encoded['capital-loss-zero'] = (X_train_encoded['capital-loss'] == 0).astype(int)

# Drop the original features if needed
X_train_encoded = X_train_encoded.drop(['capital-gain', 'capital-loss'], axis=1)
X_train_encoded.head()

Unnamed: 0,age,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,28,40,0,1,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,1
1,34,48,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,1
2,32,40,0,1,0,0,1,0,0,0,...,1,0,0,0,1,0,1,0,0,1
3,46,40,0,1,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,1,0
4,44,50,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,1


In [18]:
# X_test_encoded['capital-gain-zero'] = (X_test_encoded['capital-gain'] == 0).astype(int)
# X_test_encoded['capital-loss-zero'] = (X_test_encoded['capital-loss'] == 0).astype(int)
# 
# Drop the original features if needed
X_test_encoded = X_test_encoded.drop(['capital-gain', 'capital-loss'], axis=1)
X_test_encoded.head()

Unnamed: 0,age,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,33,50,0,1,0,0,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1
1,58,40,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1
2,30,60,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1
3,26,20,0,1,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,0,1
4,43,60,0,0,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1


Since there are 4,398/5,200 rows with no data in the training set and 8,789/10,402 rows in the testing data set, let us try and normalise this. But first, we'll engineer a new feature that calculates the net capital from both capital gain and capital loss by subtracting the loss from the gain.

In [None]:
X_train_encoded['capital-gain'] = X_train_encoded['capital-gain'].replace(0, np.nan) # replacing 0s with NaN in order to use fillna()
X_train_encoded['capital-gain'] = X_train_encoded['capital-gain'].fillna(X_train_encoded['capital-gain'].median())
X_train_encoded['capital-loss'] = X_train_encoded['capital-loss'].replace(0, np.nan) # replacing 0s with NaN in order to use fillna()
X_train_encoded['capital-loss'] = X_train_encoded['capital-loss'].fillna(X_train_encoded['capital-loss'].median())

Trying to augment the data with KNN imputation since using simple imputation like mean and median did not result in significant improvements.

In [None]:
from sklearn.impute import KNNImputer
imputer_gain = KNNImputer(n_neighbors=5, metric='nan_euclidean')
imputer_loss = KNNImputer(n_neighbors=5, metric='nan_euclidean')

X_train_encoded['capital-gain'] = X_train_encoded['capital-gain'].replace(0, np.nan)
X_train_encoded['capital-gain'] = imputer_gain.fit_transform(X_train_encoded[['capital-gain']])
X_test_encoded['capital-gain'] = X_test_encoded['capital-gain'].replace(0, np.nan)
X_test_encoded['capital-gain'] = imputer_gain.transform(X_test_encoded[['capital-gain']])

X_train_encoded['capital-loss'] = X_train_encoded['capital-loss'].replace(0, np.nan)
X_train_encoded['capital-loss'] = imputer_loss.fit_transform(X_train_encoded[['capital-loss']])
X_test_encoded['capital-loss'] = X_test_encoded['capital-loss'].replace(0, np.nan)
X_test_encoded['capital-loss'] = imputer_loss.transform(X_test_encoded[['capital-loss']])

In [None]:
X_train_encoded['net-capital'] = X_train_encoded['capital-gain'] - X_train_encoded['capital-loss']
X_train_encoded.head(10)

In [None]:
X_test_encoded['capital-gain'] = X_test_encoded['capital-gain'].replace(0, np.nan) # replacing 0s with NaN in order to use fillna()
X_test_encoded['capital-gain'] = X_test_encoded['capital-gain'].fillna(X_test_encoded['capital-gain'].median())
X_test_encoded['capital-loss'] = X_test_encoded['capital-loss'].replace(0, np.nan) # replacing 0s with NaN in order to use fillna()
X_test_encoded['capital-loss'] = X_test_encoded['capital-loss'].fillna(X_test_encoded['capital-loss'].median())

In [None]:
X_test_encoded['net-capital'] = X_test_encoded['capital-gain'] - X_test_encoded['capital-loss']
X_test_encoded.head(10)

Trying first by replacing the zero values under net-capital with its mean.

Doing the same for testing data

In [19]:
from sklearn.preprocessing import StandardScaler
non_bool_cols = ['age', 'hours-per-week']

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded[non_bool_cols])
X_test_scaled = scaler.transform(X_test_encoded[non_bool_cols])

Updating the data frame with the scaled data of non-boolean features

In [20]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=non_bool_cols)
X_train_encoded.update(X_train_scaled_df)
X_train_encoded.head()

Unnamed: 0,age,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,-0.976267,-0.25479,0,1,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,1
1,-0.475107,0.476699,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,1
2,-0.64216,-0.25479,0,1,0,0,1,0,0,0,...,1,0,0,0,1,0,1,0,0,1
3,0.527214,-0.25479,0,1,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,1,0
4,0.360161,0.659571,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,1


Doing the same for the test data

In [21]:
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=non_bool_cols)
X_test_encoded.update(X_test_scaled_df)
X_test_encoded.head()

Unnamed: 0,age,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,...,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,native-country_Others,native-country_United-States
0,-0.558633,0.659571,0,1,0,0,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1
1,1.529535,-0.25479,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1
2,-0.809214,1.573933,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,1
3,-1.143321,-2.083514,0,1,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,0,1
4,0.276634,1.573933,0,0,0,1,0,0,1,0,...,1,0,0,0,0,1,0,1,0,1


Convert the datasets to NumPy to easily use them.

In [86]:
# data features
X_train = X_train_encoded.to_numpy()
X_test = X_test_encoded.to_numpy()

## GRAPE

<div>
<img src="https://drive.google.com/uc?export=view&id=1hw43Oi3lGTCkspQ0ged2bZB8q2EpcPhz" width="150"/>
</div>

GRammatical Algorithms in Python for Evolution (GRAPE)


In [23]:
!pip install deap

import grape
import algorithms

from os import path
from deap import creator, base, tools
import random
import csv

Defaulting to user installation because normal site-packages is not writeable


You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [108]:
from functions import add, sub, mul, pdiv, psqrt, plog, and_, or_, nand_, nor_, not_, if_, less_than_or_equal, greater_than_or_equal

'heartDisease.bnf' is a grammar used for another problem just to check if everything is working well.

Write your own grammar in a text file and save it in your Drive account.

Put the whole address on GRAMMAR_FILE and print to check it.

In [152]:
GRAMMAR_FILE = '../census_income_grammar_new.bnf'

f = open(GRAMMAR_FILE, "r") #remove the # in the beginning of this line when you are using your own grammar
print(f.read())
f.close()


<expr> ::= <expr> | if_(<conditional_branches>,<expr>,<expr>) | <log_op> | <num_op> | <feature>

<log_op> ::= and_(<log_op>,<log_op>) | or_(<log_op>,<log_op>) | not_(<log_op>) | <boolean_feature>

<num_op> ::= add(<num_op>,<num_op>) | sub(<num_op>,<num_op>) | mul(<num_op>,<num_op>) | pdiv(<num_op>,<num_op>) | <nonboolean_feature>

<conditional_branches> ::= less_than_or_equal(<expr>,<expr>) | greater_than_or_equal(<expr>,<expr>)

<feature> ::= <boolean_feature> | <nonboolean_feature>

<boolean_feature> ::=  x[4]|x[5]|x[6]|x[7]|x[8]|x[9]|x[10]|x[11]|x[12]|x[13]|x[14]|x[15]|x[16]|x[17]|x[18]|x[19]|x[20]|x[21]|x[22]|x[23]|x[24]|x[25]|x[26]

<nonboolean_feature> ::= x[0]|x[1]|x[2]|x[3]|<c><c>.<c><c>

<c>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9


Run the following cell to put your grammar on the class Grammar.

In [153]:
BNF_GRAMMAR = grape.Grammar(GRAMMAR_FILE) #remove the # in the beginning of this line when you are using your own grammar

The fitness function here is the percentage of outputs wrongly predicted.

You can write your own fitness function if you prefer.

In [154]:
def fitness_eval(individual, points):
    """
    Fitness Function
    """

    x = points[0]
    Y = points[1]

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    compare = np.equal(Y,pred)
    fitness = 1 - np.mean(compare)

    return fitness,

To use properly the fitness function above with GRAPE, the features must be in the lines, and the samples must be in the columns, so if your data is not like that, you need to transpose the matrix.

Take a look at the print. If you run this cell two times, the matrix will be transposed again and will not work properly.

In [138]:
X_train = np.transpose(X_train)
X_test = np.transpose(X_test)

In [139]:
print('Training (X,Y):\t', X_train.shape, y_train.shape)
print('Test (X):\t', X_test.shape)

Training (X,Y):	 (27, 5200) (5200,)
Test (X):	 (27, 10402)


Set the Grammatical Evolution parameters.

Make sure you set a random seed just in case we need to re-run your experiments.

In [155]:
POPULATION_SIZE = 1000
MAX_GENERATIONS = 200
P_CROSSOVER = 0.76
P_MUTATION = 0.026
ELITE_SIZE = 3
HALL_OF_FAME_SIZE = 3

TOURNAMENT_SIZE = 6
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

CODON_CONSUMPTION = 'lazy'
GENOME_REPRESENTATION = 'list'
MAX_GENOME_LENGTH = None

MAX_INIT_TREE_DEPTH = 13
MIN_INIT_TREE_DEPTH = 3
MAX_TREE_DEPTH = 35
MAX_WRAPS = 0
CODON_SIZE = 255

REPORT_ITEMS = ['gen', 'invalid', 'avg', 'std', 'min', 'max',
                'best_ind_length', 'avg_length',
                'best_ind_nodes', 'avg_nodes',
                'best_ind_depth', 'avg_depth',
                'avg_used_codons', 'best_ind_used_codons',
                'structural_diversity', 'fitness_diversity',
                'selection_time', 'generation_time']

Create a toolbox.

In [156]:
toolbox = base.Toolbox()

# define a single objective, minimising fitness strategy:
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))

creator.create('Individual', grape.Individual, fitness=creator.FitnessMin)

toolbox.register("populationCreator", grape.sensible_initialisation, creator.Individual)

toolbox.register("evaluate", fitness_eval)

# Tournament selection:
toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)

# Single-point crossover:
toolbox.register("mate", grape.crossover_onepoint)

# Flip-int mutation:
toolbox.register("mutate", grape.mutation_int_flip_per_codon)

In [157]:
# create initial population (generation 0):
population = toolbox.populationCreator(pop_size=POPULATION_SIZE,
                                           bnf_grammar=BNF_GRAMMAR,
                                           min_init_depth=MIN_INIT_TREE_DEPTH,
                                           max_init_depth=MAX_INIT_TREE_DEPTH,
                                           codon_size=CODON_SIZE,
                                           codon_consumption=CODON_CONSUMPTION,
                                           genome_representation=GENOME_REPRESENTATION
                                            )

# define the hall-of-fame object:
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)

# prepare the statistics object:
stats = tools.Statistics(key=lambda ind: ind.fitness.values)
stats.register("avg", np.nanmean)
stats.register("std", np.nanstd)
stats.register("min", np.nanmin)
stats.register("max", np.nanmax)

IndexError: list index out of range

Run Grammatical Evolution.

In [133]:
population, logbook = algorithms.ge_eaSimpleWithElitism(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                                              ngen=MAX_GENERATIONS, elite_size=ELITE_SIZE,
                                              bnf_grammar=BNF_GRAMMAR,
                                              codon_size=CODON_SIZE,
                                              max_tree_depth=MAX_TREE_DEPTH,
                                              max_genome_length=MAX_GENOME_LENGTH,
                                              points_train=[X_train, y_train],
                                              codon_consumption=CODON_CONSUMPTION,
                                              report_items=REPORT_ITEMS,
                                              genome_representation=GENOME_REPRESENTATION,
                                              stats=stats, halloffame=hof, verbose=False)

SyntaxError: invalid syntax (<string>, line 0)

Show the best individual as an expression.

In [None]:
# Best individual
import textwrap
best = hof.items[0].phenotype
print("Best individual: \n","\n".join(textwrap.wrap(best,80)))
print("\nTraining Fitness: ", hof.items[0].fitness.values[0])
print("Depth: ", hof.items[0].depth)
print("Length of the genome: ", len(hof.items[0].genome))
print(f'Used portion of the genome: {hof.items[0].used_codons/len(hof.items[0].genome):.2f}')

Define a function to predict values, without comparing to expected outputs.

In [None]:
def predict(individual, X):
    x = X

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    return pred

Predict the classes of the test set.

Make sure you print here in the notebook you will submit to Brightspace the same predictions you used in your best submission to the Kaggle competition.

In [None]:
y_pred = predict(hof.items[0], X_test)
print("Predicted classes of the test set: ", y_pred)

In [None]:
results = pd.DataFrame(list(zip(range(10402), y_pred)), columns=['index', 'income']).astype(int)
display(results)

In [None]:
# Convert the predictions to a csv file
results.to_csv('../predictions.csv', index=False)

Write a code to create a .csv with the following format:
1. First column is the index (from 0 to 10401);
2. Second column is named `income` and contains the predictions (only 0's or 1's) you  got in the previous cell with y_pred.

Example:

    index,income

    0,0

    1,0

    2,1

    ...

    10401,0


Submit it to the competition and check your score there.