# Hardness measures

---
Código experimental
---

__Apenas para testar ideias e implementação das measures__
- - -

__1)__ Pegar um dataset artificial bem conhecido. Gerei o dataset overlap em anexo.

__2)__ Extrair, para cada instância do conjunto, meta-atributos de instance hardness. Em anexo seguem dois scripts em R com algumas medidas (tem as do artigo e uma outra que é do trabalho do José Luis). Segue em anexo também o TG do José Luis, as adaptações que ele fez estão descritas na Seção 3.4.

__3)__ Rodar diferentes técnicas de classificação em 10-fold CV ou leave-one-out e extrair a medida de log-loss para cada instância (http://wiki.fast.ai/index.php/Log_Loss#Multi-class_Classification). Sugiro as técnicas: 
* SVM linear
* SVM RBF
* Random Forest
* Gradient Boosting
* Rede Neural MLP com uma camada
* Bagging
* Naïve Bayes
* regressão logística

__4)__ Montar um meta conjunto de dados em que cada linha é uma das instâncias do conjunto overlap e cada coluna é uma medida de instance hardness, seguidas de colunas que são os valores de log-loss de cada técnica. Tem uma convenção dos nomes que as colunas têm que ter para usar no Matilda depois (tem um exemplo de meta-conjunto de dados para o caixeiro viajante em https://matilda.unimelb.edu.au/matilda/matildadata/graph_coloring_problem/metadata/metadata.csv): 

_The CSV file must contain only 4 types of columns listed below. Column headers should strictly follow the required naming convention._

i. _instances (instance identifier - We expect instance identifier to be of type "String". This column is mandatory)_

ii. _Source (instance source - This column is optional)_

iii. _feature_name (The keyword "feature_" concatenated with feature name. For instance, if feature name is "density", header name should be mentioned as "feature_density". If name consists of more than one word, each word should be separated by "\_" (spaces are not allowed). You can add one or more features. This column is required either in "Custom Problem" or if you want to add more features in the analysis of a library problem.)_

iv. _algo_name (The keyword "algo_" concatenated with algorithm name. For instance, if algorithm name is "Greedy", column header should be "algo_greedy". If name consists of more than one word, each word should be separated by "\_" (spaces are not allowed). You can add the performance of more than one algorithm in the same csv. This column is required either in "Custom Problem" or if you want to add more algorithms in the analysis of a library problem.)_


__5)__ Com o meta conjunto de dados pronto, executar a análise do Matilda. Dá também para fazer uma pré-seleção dos meta-atributos antes, mas aí podemos discutir depois como fazer.

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_bokeh

In [None]:
from sklearn import svm
from sklearn import tree
from sklearn.metrics import log_loss
from sklearn.neighbors import NearestNeighbors, KernelDensity
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

In [None]:
sns.set()

plt.rcParams['figure.figsize'] = (16, 10)

pandas_bokeh.output_notebook()

## Toy dataset

In [None]:
data_path = os.path.realpath("../data/")

metadata_path = os.path.join(data_path, "metadata.csv")
overlap_path = os.path.join(data_path, "overlap.csv")

In [None]:
df_metadata = pd.read_csv(metadata_path, index_col='instances')
df_overlap = pd.read_csv(overlap_path)

In [None]:
_=sns.scatterplot(data=df_overlap, x='V1', y='V2', hue='class', legend="full", palette='coolwarm')

In [None]:
df_metadata

In [None]:
df_overlap

In [None]:
X = df_overlap[['V1', 'V2']]
y = df_overlap['class']

In [None]:
import pyhard

In [None]:
m = pyhard.Measures(df_overlap, labels_col='class')

In [None]:
m.build_metadata()

In [None]:
def logloss(y_true: np.ndarray, y_pred: np.ndarray, eps=1e-15):
    enc = OneHotEncoder()
    y_true = enc.fit_transform(y_true.reshape(-1, 1)).toarray()
    
    y_pred = np.clip(y_pred, eps, 1-eps)
    return -np.sum(y_true * np.log(y_pred), axis=1)

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(probability=True)

In [None]:
clf.fit(X,y)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(y.values.reshape(-1, 1))
v=enc.transform(y.values.reshape(-1, 1)).toarray()

In [None]:
a=np.array([[1,2,3]])
a.reshape(-1, 1)

In [None]:
logloss(y.values, clf.predict_proba(X))

In [None]:
clf.classes_

## k-Disagreeing Neighbors (kDN)

In [None]:
k = 5
nbrs = NearestNeighbors(n_neighbors=k+1, algorithm='auto', metric='euclidean').fit(X)

In [None]:
distances, indices = nbrs.kneighbors(X)

In [None]:
indices.shape

In [None]:
kDN = []
for i in range(0, len(df_overlap)):
    v = df_overlap.loc[indices[i]]['class'].values
    kDN.append(np.sum(v[1:] != v[0]) / k)
df_overlap['kDN'] = kDN
df_overlap

In [None]:
sns.scatterplot(x=df_metadata['feature_KDN'].values, y=df_overlap['kDN'].values)

In [None]:
df_overlap.loc[indices[4]]['class']

## Disjunct Size (DS)

In [None]:
clf = tree.DecisionTreeClassifier(criterion='gini', min_samples_split=2) # min_samples_leaf=1
clf = clf.fit(X, y)

In [None]:
# import graphviz 

# dot_data = tree.export_graphviz(clf, 
#                                 out_file=None, 
#                                 feature_names=['V1', 'V2'],
#                                 class_names=['1', '2'],
#                                 filled=True, rounded=True,  
#                                 special_characters=True) 
# graph = graphviz.Source(dot_data)
# graph

In [None]:
df_overlap['leaf_id'] = clf.apply(X)

In [None]:
df = df_overlap.groupby('leaf_id').count().iloc[:,0].to_frame('count').subtract(1)

In [None]:
df_overlap = df_overlap.join(df, on='leaf_id')

In [None]:
df_overlap['DS'] = df_overlap['count'].divide(df_overlap['count'].max())

In [None]:
df_overlap

In [None]:
df_metadata

## Disjunct Class Percentage

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'ccp_alpha': np.linspace(0.001, 0.1, num=100)}
dtc = tree.DecisionTreeClassifier(criterion='gini')
clf_prune = GridSearchCV(dtc, parameters)
clf_prune = clf_prune.fit(X, y)

In [None]:
clf_prune.best_params_

In [None]:
dtc = tree.DecisionTreeClassifier(criterion='gini', ccp_alpha=clf_prune.best_params_['ccp_alpha'])
dtc = dtc.fit(X,y)

In [None]:
df_overlap['leaf_id'] = dtc.apply(X)

In [None]:
%%time

df3 = df_overlap.rename(columns={'class':'y'})
dcp = []
for index, row in df3.iterrows():
    df_leaf = df3[df3['leaf_id'] == row['leaf_id']]
    dcp.append(len(df_leaf[df_leaf['y'] == row['y']]) / len(df_leaf))
    
df3['DCP'] = dcp
df3

In [None]:
sns.scatterplot(x=df_metadata['feature_TD'].values, y=TP)

In [None]:
TP = X.apply(lambda x: dtc.decision_path([x]).sum()-1, axis=1, raw=True).values

## Class Likeliood

In [None]:
X['V1']

In [None]:
sns.distplot(X['V2'])

In [None]:
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
nb = GaussianNB(priors=[0.5, 0.5])
nb.fit(X, y)

In [None]:
prob = nb.predict_proba(X)

In [None]:
lab = y.values
CL=[prob[i, lab[i]-1] for i in range(0,len(lab))]

In [None]:
CLD = [prob[i, lab[i]-1]-np.delete(prob[i,:], lab[i]-1).max() for i in range(0,len(lab))]

In [None]:
np.delete(np.array([1,2,3,4,5]), 4).max()

In [None]:
sns.scatterplot(x=df_metadata['feature_CLD'].values, y=CLD)

## Minority Value

In [None]:
MVc = df_overlap.groupby('class').count().iloc[:,0]
MVc=MVc.divide(MVc.max())

1-y.apply(lambda c: MVc[c]).values

## N1

In [None]:
import gower

In [None]:
dist_matrix = gower.gower_matrix(X)

In [None]:
from scipy.sparse.csgraph import minimum_spanning_tree

In [None]:
Tcsr = minimum_spanning_tree(dist_matrix)

In [None]:
mst = Tcsr.toarray()

In [None]:
mst = np.where(mst>0, mst, np.inf)
c = y.values
N1 = np.zeros(c.shape)
for i in range(len(c)):
    idx = np.argwhere(np.minimum(mst[i,:], mst[:,i]) < np.inf)
    assert len(idx) > 0
    N1[i] = np.sum(c[idx[:,0]] != c[i])

In [None]:
N1[N1>0]

## N2

In [None]:
k = len(X)
nbrs = NearestNeighbors(n_neighbors=k, algorithm='auto', metric='euclidean').fit(X)
distances, indices = nbrs.kneighbors(X)

In [None]:
indices = np.argsort(dist_matrix, axis=1)
distances = np.sort(dist_matrix, axis=1)

In [None]:
N2 = np.zeros(y.values.shape)
for i, value in y.items():
    nn = y.loc[indices[i,:]]
    intra = nn.eq(value) # .idxmax()
    extra = nn.ne(value) # .idxmax()
    assert np.all(np.diff(distances[i, intra]) >= 0)
    assert np.all(np.diff(distances[i, extra]) >= 0)
    N2[i] = distances[i, intra][1]/distances[i, extra][0]

In [None]:
N2