# Data Comparison
The purpose of this notebook is to compare the GA performance to found performance as well as to describe the datasets used to benchmark see-classify in terms of number of items, number of features, and number of labels.

In [1]:
from figures_markdown_snippets import show_data_instructions

show_data_instructions()


To generate data, run:
- for sklearn
    ```bash
sbatch generate_sklearn_data.sb -n 10 -p 10
    ```
- for dhahri
    ```bash
sbatch generate_dhahri_data.sb -n 20 -p 100 -t 30
    ```
These commands generate files for running genetic search using 10 generations and population size of 10
for the sklearn tutorial and 20 generations with population size of 100
for the Breast Cancer Wisconsin (Diagnostic) Dataset respectively.

The `-n`, `-p`, `-t` flags control number of generations, population sizes, 
and number of trials respectively.

The data generated tracks the top 10 individuals and the population by
each generation and is stored in the corresponding output 
(i.e. the slurm_\[id\].out file).

To extract this data to a csv file, run:
- for the top 10 individuals:
    ```bash
grep "# GEN HOF_index" slurm_[id].out | cut -d '|' -f2 > "filename_1.csv"
    ```
- for the population:
    ```bash
grep "# GEN population_index" slurm_[id].out | cut -d '|' -f2 > "filename_2.csv"
    ```
    
If one has multiple output files that contain different trials of the same
type of GA run, one could try the moving all the relevant files into 
one directory, changing to that directory, and running the following commands instead:
- for the top 10 individuals:
    ```bash
grep "# GEN HOF_index" *.out | cut -d '|' -f2 > "filename_1.csv"
    ```
- for the population:
    ```bash
grep "# GEN population_index" *.out | cut -d '|' -f2 > "filename_2.csv"
    ```


In [2]:
# Path hack so that we can import see library.
import sys, os
sys.path.insert(0, os.path.abspath('..'))

## Describing Datasets

In [3]:
from sklearn.datasets import make_moons, make_circles, make_classification
import numpy as np
from see.classifier_helpers.fetch_data import fetch_wisconsin_data

# Create datasets

datasets = []

# Circles
datasets.append(make_circles(noise=0.2, factor=0.5, random_state=1))

# Linearly Separable
X, y = make_classification(
    n_features=2,
    n_redundant=0,
    n_informative=2,
    random_state=1,
    n_clusters_per_class=1,
)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
datasets.append((X, y))

# Moons
datasets.append(make_moons(noise=0.3, random_state=0))

# Wisconsin Breast Cancer Dataset used by Dhahri 2019
datasets.append(fetch_wisconsin_data())

ds_names = [
    "Circles (Tutorial)",
    "Linearly Separable (Tutorial)",
    "Moons (Tutorial)",
    "Breast Cancer Wisconsin Diagnositc (Used in Dhahri 2019)",
]

In [4]:
# Describe datasets

import pandas as pd
descriptions = pd.DataFrame(index=['# Items', '# Features', '# Classes'])
for ds, name in zip(datasets, ds_names):
    X, y = ds
    descriptions[name] = [len(X), len(X[0]), 2]

In [5]:
descriptions

Unnamed: 0,Circles (Tutorial),Linearly Separable (Tutorial),Moons (Tutorial),Breast Cancer Wisconsin Diagnositc (Used in Dhahri 2019)
# Items,100,100,100,569
# Features,2,2,2,30
# Classes,2,2,2,2


## Comparing search results to optimal results
We will examine the top 10 solutions found at the end of GA to compare the results of GA to tutorial/paper found best solutions.

In [6]:
# Set file names

hof_datasets = [
    "../data/0802_sklearn_data/circles_hof_100_100.csv",
    "../data/0802_sklearn_data/linearly_separable_hof_100_100.csv",
    "../data/0802_sklearn_data/moons_hof_100_100.csv",
    "../data/0730_dhahri_simple_data/pop_size_100/dhahri_2019_hof_100_100.csv"
]

In [7]:
# Describe GA runs
num_gen = 100
pop_size = 100
hof_size = 10

In [8]:
import pandas as pd
from figures_helpers import extract_hof

# TODO: Round sig figs
best_scores = pd.DataFrame()
for i, csv in enumerate(hof_datasets):
    df = extract_hof(csv)
    temp = df[df[0] == num_gen] # Extract Hall of fame members at final generation
    temp = temp[2].describe()
    temp.name = ds_names[i]
    best_scores = best_scores.append(temp)

# Add best
best_scores.insert(loc=0, column="Tutorial/Paper Best Found", value=[0.075, 0.050, 0.025, 0.0176])
best_scores.round(4)

Unnamed: 0,Tutorial/Paper Best Found,25%,50%,75%,count,max,mean,min,std
Circles (Tutorial),0.075,0.025,0.025,0.025,1000.0,0.025,0.025,0.025,0.0
Linearly Separable (Tutorial),0.05,0.05,0.05,0.05,1000.0,0.1,0.0511,0.05,0.0051
Moons (Tutorial),0.025,0.025,0.025,0.05,1000.0,0.125,0.0348,0.025,0.0134
Breast Cancer Wisconsin Diagnositc (Used in Dhahri 2019),0.0176,0.0175,0.0175,0.0175,1000.0,0.0526,0.019,0.0175,0.0052
