## Keras notebook with clustering added

### Version 5, best score prior to 11/30
1. Plot PCA explained variance, pick a number of PCA dimensions to pass to NN.
1. Visualize 2D PCA of this dataset.
1. Conduct elbow and/or silhouette tests of the data reduced to various PCA dimensions.
1. Select an optimal clustering algorithm and hyperparameters and number of PCA dimensions to cluster on, then cluster.
1. Make cluster id a categorical feature, hot encode it.
1. Pass PCA dimensions and cluster id to three layer NN.
1. Tune the network & train up the best candidate.
`'units 1': 512, 'activation function 1': 'relu', 'dropout 1': 0.35, 
'units 2': 512, 'activation function 2': 'relu', 'dropout 2': 0.5, 
'units 3': 512, 'activation function 3': 'relu', 'dropout 3': 0.2`
Score: 0.02222

### Version 7
`'units 1': 512, 'activation function 1': 'relu', 'dropout 1': 0.2, 
'units 2': 2048, 'activation function 2': 'relu', 'dropout 2': 0.2, 
'units 3': 512, 'activation function 3': 'elu', 'dropout 3': 0.35`

### Version 8, I mean 9
Cut out the clustering search phase and hardcode the bw30, 2 dims clustering.
Cut out the kerastuner and hardcode the Version 7 architecture.
Add the custom log loss function to evaluate the results since I'm actually finally in danger of running out of submissions for the day.
Score: 0.05464

### Version 10, 11, 11/30
Put in GF's logloss function and use as metric.
Use VarianceThreshold, QuantileTransformer, and ICA in place of PCA breakdown.
Cannot cluster on ICA'd data, so removed clustering from Version 11.
Score: 0.02092

### Version 12
Cluster using Mahlananobis distance, if practical. Mahlananobis distance obviates the need for pre-scaling the data and can be used on the raw data prior to transformations or anywhere prior to ICA. Scikit-learn's documentation comments that this is implemented in the Gaussian Mixture clustering algorithms. Code from https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html is incorporated below.

### Version 13
Implement a cross-validation and ensemble prediction routine.

In [None]:
import numpy as np
import pandas as pd
import itertools
from scipy import linalg
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, QuantileTransformer
from sklearn.decomposition import FastICA
from sklearn.feature_selection import VarianceThreshold
from sklearn import mixture
import category_encoders as ce
#!pip install iterative-stratification
#from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
sns.set_style('whitegrid')

In [None]:
import tensorflow as tf
import keras
import keras.backend as K
from keras.models import Model
import kerastuner
from keras.layers import Dense, Input, Dropout, BatchNormalization

In [None]:
def logloss(y_true,y_pred):
    # y_pred = tf.clip_by_value(y_pred,1e-20,1-1e-20)
    y_pred = tf.clip_by_value(y_pred,0.001,0.999)
    return -K.mean(y_true*K.log(y_pred) + (1-y_true)*K.log(1-y_pred))

In [None]:
tf_df = pd.read_csv('../input/lish-moa/train_features.csv',index_col='sig_id')
tts_df = pd.read_csv('../input/lish-moa/train_targets_scored.csv',index_col='sig_id')
scaler = MinMaxScaler(feature_range=(-0.5,0.5))
tf_df['cp_time']=scaler.fit_transform((np.array(tf_df['cp_time'])).reshape(-1,1))
oenc = ce.ordinal.OrdinalEncoder()
tf_df = oenc.fit_transform(tf_df)
tf_df['cp_type']=tf_df['cp_type']-1
tf_df['cp_dose']=tf_df['cp_dose']-1
tf_df.head()

In [None]:
testf_df = pd.read_csv('../input/lish-moa/test_features.csv',index_col='sig_id')
testf_df['cp_time']=scaler.transform((np.array(testf_df['cp_time'])).reshape(-1,1))
testf_df = oenc.transform(testf_df)
testf_df['cp_type']=testf_df['cp_type']-1
testf_df['cp_dose']=testf_df['cp_dose']-1

In [None]:
totalf_df = pd.concat([tf_df,testf_df],axis=0)

0. Gaussian Mixture selection

In [None]:
X = totalf_df.values
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = mixture.GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(X)
        bic.append(gmm.bic(X))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

In [None]:
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.gca()
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .55 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
plt.show()
Y_ = clf.predict(X)

Hot damn, it ran. WTF error on the cluster plotting (code deleted) does not prevent me from using the 6 clusters, diag results. (Wow, that legend. Why is it in such mixed-up order?)

In [None]:
Y_.shape

In [None]:
labels = pd.Series(Y_)
labels.value_counts()

That's tasty and intriguing. I hope it doesn't evaporate on the rerun.

In [None]:
ohenc = OneHotEncoder(drop='first',sparse=False,dtype=np.int)
cats = ohenc.fit_transform(Y_.reshape(-1,1))
cats[:5,:]

1. VarianceThreshold

In [None]:
weeder = VarianceThreshold(0.95)
# high variance features retained
highvf_arr = weeder.fit_transform(totalf_df.loc[:,'g-0':'c-99'])
highvf_arr.shape

2. QuantileTransformer

In [None]:
leveller = QuantileTransformer(n_quantiles=100,output_distribution='normal')
qhighvf_arr = leveller.fit_transform(highvf_arr)
qhighvf_arr.shape

3. ICA. ICA "super-standardizes" the features to mean 0 and variance well under 1.

In [None]:
strategy = tf.distribute.get_strategy()
print("Number of accelerators: ", strategy.num_replicas_in_sync)

In [None]:
with strategy.scope():
    ica = FastICA(n_components=300,max_iter=500)
    ica_arr = ica.fit_transform(qhighvf_arr)
print('Mean over ICA features:',ica.mean_)

4. Visualize first few ICA components.

In [None]:
sns.pairplot(pd.DataFrame(ica_arr[:,0:5],
                          index=range(ica_arr.shape[0]),
                          columns=range(5)))

Apparently FastICA is horrifically stochastic and unpredictable. The first time through it produced good clusters, but that's disappeared and apparently won't come back.

5. Bring together the encoded cp features and ICA components to form the prepared feature array.

In [None]:
cp_arr = totalf_df.loc[:,'cp_type':'cp_dose'].values
print(type(cp_arr),cp_arr.shape,ica_arr.shape)

In [None]:
finalf_arr = np.concatenate((cats,cp_arr,ica_arr),axis=1)
finalf_arr.shape

In [None]:
tts_arr = tts_df.values
tts_arr.shape

In [None]:
n_train = tf_df.shape[0]
n_test = testf_df.shape[0]
n_features = finalf_arr.shape[1]
n_targets = tts_arr.shape[1]
assert n_train + n_test == finalf_arr.shape[0]
tf_arr = finalf_arr[:n_train,:].copy()
testf_arr = finalf_arr[-1*n_test:,:].copy()

In [None]:
print(tf_arr.shape,testf_arr.shape,tf_df.shape,testf_df.shape)

8. Get crackalackin'.

In [None]:
with strategy.scope():
    inputs = Input(shape=(n_features,))
    x = Dense(256,activation='elu')(inputs)
    x = Dropout(0.2)(x)
    x = Dense(64,activation='elu')(x)
    x = Dropout(0.2)(x)
    x = Dense(256,activation='elu')(x)
    x = Dropout(0.2)(x)
    outputs = Dense(n_targets,activation='sigmoid')(x)
    model = Model(inputs,outputs)
    model.compile('adam', 'binary_crossentropy', metrics=[logloss])

In [None]:
model.summary()

In [None]:
n_epochs = 40
n_batch = 32
split = 0.2
print('Starting Training')
history = model.fit(tf_arr,tts_arr,validation_split=split,
                    epochs=n_epochs,batch_size=n_batch,verbose=2)
print('Finished Training')

In [None]:
model.evaluate(tf_arr,tts_arr)

In [None]:
with strategy.scope():
    inputs = Input(shape=(n_features,))
    x = Dense(512,activation='elu')(inputs)
    x = Dropout(0.2)(x)
    x = Dense(512,activation='elu')(x)
    x = Dropout(0.2)(x)
    x = Dense(512,activation='elu')(x)
    x = Dropout(0.2)(x)
    outputs = Dense(n_targets,activation='sigmoid')(x)
    model2 = Model(inputs,outputs)
    model2.compile('adam', 'binary_crossentropy', metrics=[logloss])
model2.summary()
print('Starting Training')
history = model2.fit(tf_arr,tts_arr,validation_split=split,
                    epochs=n_epochs,batch_size=n_batch,verbose=2)
print('Finished Training')
model2.evaluate(tf_arr,tts_arr)

In [None]:
tts_pred = model2.predict(testf_arr)
sub_df = pd.DataFrame(tts_pred,index=testf_df.index,columns=tts_df.columns)
sub_df.head()

In [None]:
sub_df.to_csv('/kaggle/working/submission.csv')