## Optimize parameters with genetic algorithm

#### Genetic Algorithm, so what is it? 

Genetic algorithm is an optimization method with the fundamental idea of it based on Darwin's Theory of Evolution. The process follows the steps below: 

1. *Initialize Population*: Create an initial population aka your model parameters
2. *Calculate fitness*: Using a fitness function ie) ROC, Maximum, Custom function
3. *Crossover*: Combines parents to form next generation
4. *Mutation*: Apply random changes to individual parents to form children
5. *Select survivor*: If criteria is not met go to step 2, else go to next step 
6. *Terminate and return best*

### Benchmark Model

#### Import the libraries
Import some libraries that will be used along with model and data. 

In [27]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot
import os
%matplotlib inline 

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### Build benchmark model
Building a logistic regression model using [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)

In [28]:
import os
df = pd.read_csv(os.getcwd()+'/bank/bank.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


Since the data has missed data types we'll go ahead and run one hot encoding to make all data numeric. 

In [29]:
df = pd.get_dummies(data=df, columns=df.loc[:, df.dtypes == object].columns.drop('y'))
df.y = df.y.replace(to_replace=['no', 'yes'], value=[0, 1])
df.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,job_admin.,job_blue-collar,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1,33,4789,11,220,1,339,4,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,35,1350,16,185,1,330,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,30,1476,3,199,4,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,59,0,5,226,1,-1,0,0,0,1,...,0,0,1,0,0,0,0,0,0,1


In [34]:
X_train, X_test, y_train, y_test = train_test_split(df, 
                                                    df.y, test_size=0.30, 
                                                    random_state=101,
                                                    stratify=df.y)

Build a logistic regression model

In [36]:
logmodel = LogisticRegression(max_iter=1000)
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print("Accuracy = "+ str(accuracy_score(y_test,predictions)))

Accuracy = 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The accuracy of the model is already fairly high given the fact that the dataset is not something that is seen in real-life. Let's see if the power of evolution can push this model to perform even better. 

### Setting up the genetic algorithm

#### Initialize the population
We'll use binary in our case where size determines the length of our population(list), n_feat determines how many parameters we want to tune, and the ratio allowing us to control the ratio of n_feat to define as False(0). In the case of continous values, the ratio may not need to be used.

In [147]:
def initialization_of_population(size,n_feat,ratio):
    population = []
    for i in range(size):
        chromosome = np.ones(n_feat,dtype=np.bool)
        chromosome[:int(ratio*n_feat)]=False
        np.random.shuffle(chromosome)
        population.append(chromosome)
    return population

#### Define fitness score 
Create a fitness score function that takes the population as input and returns score(accuracy using the population) and the population which was used

In [148]:
def fitness_score(population):
    scores = []
    for chromosome in population:
        logmodel.fit(X_train.iloc[:,chromosome],y_train)
        predictions = logmodel.predict(X_test.iloc[:,chromosome])
        scores.append(accuracy_score(y_test,predictions))
    scores, population = np.array(scores), np.array(population) 
    inds = np.argsort(scores)
    return list(scores[inds][::-1]), list(population[inds,:][::-1])

#### Select top scoring population as parent
This is the first step of creating the next generation population by selecting n_parents(where n_parents <= size) to create the next population. Here we will select the top parents based on descending order of score (see fitness_score function for sorting).

In [149]:
def selection(population,n_parents):
    population_nextgen = []
    for i in range(n_parents):
        print(population_nextgen.append(population[i]))
    return population_nextgen

#### Crossover to create child from parent
The second step of ceating the next generation population is crossover. In this step we use the population selected using the selection function and for each population we perform a crossover. 

>Crossover is the Genetic Algorithm’s distinguishing feature. It involves mixing and matching parts of two parents to form children. How you do that mixing and matching depends on the representation of the individuals. — Page 36, [Essentials of Metaheuristics](https://www.amazon.com/Essentials-Metaheuristics-Sean-Luke/dp/0557148596/ref=as_li_ss_tl?dchild=1&keywords=Essentials+of+Metaheuristics&qid=1603664181&s=books&sr=1-2&linkCode=sl1&tag=inspiredalgor-20&linkId=2f827fb9c35f95c5d0be10b2693e9c42&language=en_US), 2011.

In [150]:
def crossover(pop_after_sel):
    population_nextgen=pop_after_sel
    for i in range(len(pop_after_sel)):
        child=pop_after_sel[i]
        child[3:7]=pop_after_sel[(i+1)%len(pop_after_sel)][3:7] # Arbitrary method as example
        population_nextgen.append(child)
    return population_nextgen

#### Mutation process

We also need a functin to perfom muation. This process will flip the bits based on random probability and the mutation rate

In [151]:
def mutation(pop_after_cross,mutation_rate):
    population_nextgen = []
    for i in range(0,len(pop_after_cross)):
        chromosome = pop_after_cross[i]
        for j in range(len(chromosome)):
            if random.random() < mutation_rate:
                chromosome[j]= not chromosome[j]
        population_nextgen.append(chromosome)
    return population_nextgen

#### Putting it all together
Now that we have all the steps built we can put it together and see if the accuracy actually improves. 

In [152]:
def generations(size,n_feat,ratio,n_parents,mutation_rate,n_gen,X_train,
                                   X_test, y_train, y_test):
    best_chromo= []
    best_score= []
    population_nextgen=initialization_of_population(size,n_feat,ratio)
    for i in range(n_gen):
        scores, pop_after_fit = fitness_score(population_nextgen)
        print(scores[:2])
        pop_after_sel = selection(pop_after_fit,n_parents)
        pop_after_cross = crossover(pop_after_sel)
        population_nextgen = mutation(pop_after_cross,mutation_rate)
        best_chromo.append(pop_after_fit[0])
        best_score.append(scores[0])
    return best_chromo,best_score

In [155]:
chromo,score=generations(size=200,n_feat=len(ohe_data.columns)+1,ratio=0.7,n_parents=100,mutation_rate=0.10,
                     n_gen=38,X_train=X_train,X_test=X_test,y_train=y_train,y_test=y_test)
logmodel.fit(X_train.iloc[:,chromo[-1]],y_train)
predictions = logmodel.predict(X_test.iloc[:,chromo[-1]])
print("Accuracy score after genetic algorithm is= "+str(accuracy_score(y_test,predictions)))

[0.9473684210526315, 0.9415204678362573]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
[0.9473684210526315, 0.9473684210526315]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Non

[0.9473684210526315, 0.9473684210526315]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
[0.9532163742690059, 0.9532163742690059]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Non

[0.9473684210526315, 0.9473684210526315]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
[0.9473684210526315, 0.9473684210526315]
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Non

Interesting results. My guess is that there is some data issues that need to be ironed out or the fact that the dataset is much cleaner than expected may be an issue as well 


In [156]:
logmodel.coef_

array([[-0.93227053,  0.48572209, -0.9952891 , -0.09512142, -0.62292499,
         2.17457489, -1.45761938, -0.0650736 , -0.04232251, -0.65717391,
        -1.95015527, -1.25125456]])