# Notes #
1. Data preparation
    + Prepare in such a way that the class balance is different from  
    the previous datasets
    + No missing values
    + What kind of split should I do(for class)?
        + 10/90 
2. Keras model
3. Scikit model
    - Note -- Check Keras wrapper for cross validation
4. Validation


**Rough notes**
1. Check the class distribution of iris dataset
2. Class distribution details
    + Iris - 50/50
    + Adult salary - 25/75

---

**Data preparation**

In [1]:
# Load dataset info #
import automation_script
import pandas as pd
import numpy as np
from os import path

dataset_name = "UCI Abalone"
dataset_info = automation_script.get_url(dataset_name)

In [2]:
# Data preparation #

data_info = automation_script.get_url("UCI Abalone")

names = ["sex", "length", "diameter", "height", "whole weight",
        "shucked weight", "viscera weight", "shell weight", "rings"]
url = "../data/abalone.data.csv" if path.exists("../data/dataset.csv.csv") else dataset_info['url']
data = pd.read_csv(url, delimiter=",", header=None, names=names, index_col=False)
data.head()

# Check for columns that contain missing values #
col_names = data.columns

num_data = data.shape[0]
for c in col_names:
    num_non = data[c].isin([None]).sum()
    if num_non > 0:
        print (c)
        print (num_non)
        print ("{0:.2f}%".format(float(num_non) / num_data * 100))
        print ("\n")

# Convert categorical fields #
categorical_col = ['sex']
for col in categorical_col:
    b, c = np.unique(data[col], return_inverse=True)
    data[col] = c

    
# Filter dataset to contain 'rings' 9 and 10 #
data = data[data['rings'].isin([9,10])]
data['rings'] = data['rings'].map({9: 0, 10: 1})


feature_list = names[:7]
X = data.loc[:, feature_list]
Y = data[['rings']]


---

**Keras Model**

In [3]:
config = {
    'epoch': 200,
    'batch_size': 100,
    'verbose': 0,
    'model_info': {
        'loss':'binary_crossentropy',
        'optimizer':'adam',
        'metrics':['accuracy']
    }
}

keras_score,keras_params = automation_script.get_keras_params(X,Y,dataset_info,config)

Using TensorFlow backend.



acc: 56.17%


---

**Scikit model**

In [4]:
scikit_score, scikit_params = automation_script.get_scikit_params(X,Y)

0.5944584382871536


  y = column_or_1d(y, warn=True)


---

**Kfold validation**

In [5]:
config = {
    'epoch': 500,
    'batch_size': 100,
    'splits':10,
    'model_info': {
        'loss':'binary_crossentropy',
        'optimizer':'adam',
        'metrics':['accuracy']
    }
}

kfold_acc = automation_script.get_kfold(X,Y,config)

acc: 61.65%
acc: 52.63%
acc: 55.64%
acc: 65.41%
acc: 65.91%
acc: 57.58%
acc: 60.61%
acc: 56.82%
acc: 59.09%
acc: 53.44%
58.88% (+/- 4.34%)

   -------------   



<Figure size 640x480 with 1 Axes>

[[284 126]
 [194 190]]

   -------------   

             precision    recall  f1-score   support

    Class 1       0.59      0.69      0.64       410
    Class 2       0.60      0.49      0.54       384

avg / total       0.60      0.60      0.59       794



---

#### Write back to Master sheet ####

In [6]:
dataset_info['scikit_params'] = scikit_params
dataset_info['keras_params'] = keras_params
dataset_info['type'] = 'Binary'
accuracy_values = {
    'keras': keras_score,
    'scikit': scikit_score,
    'kfold': kfold_acc
}

automation_script.write_to_mastersheet(dataset_info,X,Y,accuracy_values)

---

# Pending #
1. Why is the performance so low in both the methods
2. Fix ROC curve