# CLASIFICACIÓN - Entrenamiento y prueba

**NOMBRES**:
* Andrea Marcela Castrillon Buitrago
* Yeison Fernando Villamil Franco

Como fue indicado en el notebook de entrenamiento, debido a que no se tiene un conocimiento profundo del comportamiento de las respuestas de las expresiones génicas y viabilidad celular, se deciden tomar todos los valores. El problema de clasificación, muestra un dataset desbalanceado.

Como primera iteración, serán seleccionados dos proteínas (variables de salida) con pocas activaciones o valores de 1 y con la proteína que tiene la mayor cantidad de activicaciones para evaluar los modelos: 

* Naive Bayes - Multinomial
* Regresión logística
* Random Forest
* Máquinas de soporte vectorial (SVM)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import scipy.stats as stats

In [None]:
train_features = pd.read_csv('train_features.csv')
train_target = pd.read_csv('train_targets_scored.csv')
test_features = pd.read_csv('test_features.csv')

In [None]:
train_features.shape, train_target.shape, test_features.shape

### Selección de variables

In [None]:
train_features.head(5)

In [None]:
train_target.head(5)

Vamos a realizar un filtro para hacer un entrenamiento que tenga una perturbación con químicos. Para el caso de los datos, estos son con `trt_cp`. Considerando los mismo tiempos de dosis, y la dosis.

Se realizará primero un `merger` para poder quitar la misma cantidad de filas en el Xtrain y ytrain.

In [None]:
data_train = pd.concat([train_features, train_target], axis = 1)
data_train.head(5)

In [None]:
data_train = data_train[data_train['cp_type'] == 'trt_cp']
# data_train.shape

In [None]:
X_prob = test_features[test_features['cp_type'] == 'trt_cp']
# X_prob.shape

In [None]:
X = data_train.iloc[:,4:876]
y = data_train.iloc[:,877:]
X_prob_test = X_prob.iloc[:,4:]

In [None]:
y_t = y['5-alpha_reductase_inhibitor']
y_t2 = y['nfkb_inhibitor']

In [None]:
X.shape, y.shape, X_prob_test.shape

### Standard Scaler

Filtrados los dataframe, se procederá a tomar un label para hacer un prueba para modelos de clasificación sencillos. Para el caso del `y` de entrenamiento, serán tomados dos etiquetas (label). (Se realizará una prueba con un label con pocos valores de 1 y posteriormente, con más valores de 1)

Se generará un split para poder tener un datos de train y test estratificado en función de la variable y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_t, random_state = 42, test_size=0.3, stratify = y_t)

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y_t2, random_state = 42, test_size=0.3, stratify = y_t2)

### Análisis para la salida `5-alpha_reductase_inhibitor` 

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
y_train.hist()
# y_test.hist()

In [None]:
y_train.value_counts()

In [None]:
y_test.hist()

In [None]:
y_test.value_counts()

In [None]:
# train_target[train_target.columns[137]].value_counts()

### Analísis de la salida `nfkb_inhibitor`

In [None]:
X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape

In [None]:
y_train2.hist()

In [None]:
y_train2.value_counts()

In [None]:
y_test2.hist()

In [None]:
y_test2.value_counts()

# Modelos de clasificación

## Naive-Bayes

### Label --> `5-alpha_reductase_inhibitor`

*Para la estandarización (normalización) de los datos, el modelo de Bayes no permite usar valores negativos. Se decide usar `MinMaxScaler` para no tener valores negativos en la variables de entrada.*

In [None]:
y_entren = np.array(y_train)
y_prueba= np.array(y_test)

In [None]:
# X_trainnp = X_train.iloc[:,:].to_numpy()
# X_testnp = X_test.iloc[:,:].to_numpy()
# y_trainnp = y_train.iloc[:,:].to_numpy()
# y_testnp = y_test.iloc[:,:].to_numpy()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_n = scaler.transform(X_train)
X_test_n = scaler.transform(X_test)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler1 = MinMaxScaler().fit(X_train)
X_train_max = scaler1.transform(X_train)
X_test_max = scaler1.transform(X_test)
X_test_prob = scaler1.transform(X_prob_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha = 1)

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold, StratifiedShuffleSplit, ShuffleSplit

In [None]:
nfolds = 5
cv = StratifiedShuffleSplit(n_splits = nfolds)
# pred_prob = np.zeros((X_prob_test.shape[0], y_trainnp.shape[1]))
# pred_train = np.zeros((X_train_max.shape[0], y_trainnp.shape[1]))

In [None]:
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score, plot_confusion_matrix

for fn, (train_ind, val_ind) in enumerate(cv.split(X_train_max, y_entren)):
    print('Starting fold', fn)
    X_tr, X_val = X_train_max[train_ind], X_train_max[val_ind]
    y_tr, y_val = y_entren[train_ind], y_entren[val_ind]
    clf.fit(X_tr, y_tr)
    
    y_pred = clf.predict(X_tr)
    y_vali = clf.predict(X_val)
    
    error_pred = balanced_accuracy_score(np.ravel(y_tr), y_pred)
    error_val = balanced_accuracy_score(np.ravel(y_val), y_vali)
    
print('BAC de entrenamiento =', error_pred)
print('BAC de validación =', error_val)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score, plot_confusion_matrix
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder().fit(np.unique(y['5-alpha_reductase_inhibitor']))

pred_test = clf.predict(X_test_max)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf, X_test_max, y_prueba, display_labels=np.unique(y['5-alpha_reductase_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - MNB(5-alpha_reductase_inhibitor)')
plt.show()

### Label --> `nfkb_inhibitor`

In [None]:
y_entren2 = np.array(y_train2)
y_prueba2= np.array(y_test2)

In [None]:
scaler2 = StandardScaler().fit(X_train2)
X_train_n2 = scaler2.transform(X_train2)
X_test_n2 = scaler2.transform(X_test2)

In [None]:
scaler_ = MinMaxScaler().fit(X_train2)
X_train_max2 = scaler_.transform(X_train2)
X_test_max2 = scaler_.transform(X_test2)
X_test_prob = scaler_.transform(X_prob_test)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score, plot_confusion_matrix

for fn, (train_ind, val_ind) in enumerate(cv.split(X_train_max2, y_entren2)):
    print('Starting fold', fn)
    X_tr, X_val = X_train_max2[train_ind], X_train_max2[val_ind]
    y_tr, y_val = y_entren2[train_ind], y_entren2[val_ind]
    clf.fit(X_tr, y_tr)
    
    y_pred = clf.predict(X_tr)
    y_vali = clf.predict(X_val)
    
    error_pred = balanced_accuracy_score(np.ravel(y_tr), y_pred)
    error_val = balanced_accuracy_score(np.ravel(y_val), y_vali)
    
print('BAC de entrenamiento =', error_pred)
print('BAC de validación =', error_val)

In [None]:
le = LabelEncoder().fit(np.unique(y['nfkb_inhibitor']))

pred_test = clf.predict(X_test_max2)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba2,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba2,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba2),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf, X_test_max2, y_prueba2, display_labels=np.unique(y['nfkb_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - MNB(nfkb_inhibitor)')
plt.show()

## Regresión logística

### Label --> `5-alpha_reductase_inhibitor`

In [None]:
from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression(random_state = 17, class_weight = 'balanced', max_iter=1000)

In [None]:
# cv1 = StratifiedShuffleSplit(n_splits = 6)

for fn, (train_ind, val_ind) in enumerate(cv.split(X_train_n, y_entren)):
    print('Starting fold', fn)
    X_tr, X_val = X_train_n[train_ind], X_train_n[val_ind]
    y_tr, y_val = y_entren[train_ind], y_entren[val_ind]
    clf2.fit(X_tr, y_tr)
    
    y_pred = clf2.predict(X_tr)
    y_vali = clf2.predict(X_val)
    
    error_pred = balanced_accuracy_score(y_tr, y_pred)
    error_val = balanced_accuracy_score(y_val, y_vali)
    
print('BAC de entrenamiento =', error_pred)
print('BAC de validación =', error_val)

In [None]:
le = LabelEncoder().fit(np.unique(y['5-alpha_reductase_inhibitor']))

pred_test = clf2.predict(X_test_n)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf2, X_test_n, y_prueba, display_labels=np.unique(y['5-alpha_reductase_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - LR(5-alpha_reductase_inhibitor)')
plt.show()

### Label --> `nfkb_inhibitor`

In [None]:
for fn, (train_ind, val_ind) in enumerate(cv.split(X_train_n2, y_entren2)):
    print('Starting fold', fn)
    X_tr, X_val = X_train_n2[train_ind], X_train_n2[val_ind]
    y_tr, y_val = y_entren2[train_ind], y_entren2[val_ind]
    clf2.fit(X_tr, y_tr)
    
    y_pred = clf2.predict(X_tr)
    y_vali = clf2.predict(X_val)
    
    error_pred = balanced_accuracy_score(y_tr, y_pred)
    error_val = balanced_accuracy_score(y_val, y_vali)
    
print('BAC de entrenamiento =', error_pred)
print('BAC de validación =', error_val)

In [None]:
le = LabelEncoder().fit(np.unique(y['nfkb_inhibitor']))

pred_test = clf2.predict(X_test_n2)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba2,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba2,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba2),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf2, X_test_n2, y_prueba2, display_labels=np.unique(y['nfkb_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - LR(nfkb_inhibitor)')
plt.show()

# Random Forest

### Label --> `5-alpha_reductase_inhibitor`

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_b = RandomForestClassifier(random_state=0, class_weight='balanced_subsample')

In [None]:
parameters = {'n_estimators':[20,40,60,80,100,120], 'max_depth':[2,4,6,8], 'max_features':[10,20,30,40,50]}

clf_5 = GridSearchCV(estimator=clf_b, param_grid=parameters, cv=cv, scoring='balanced_accuracy', return_train_score=True, verbose=5)
clf_5.fit(X_train_n, y_entren)

In [None]:
le = LabelEncoder().fit(np.unique(y['5-alpha_reductase_inhibitor']))

pred_test = clf_5.predict(X_test_n)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf_5, X_test_n, y_prueba, display_labels=np.unique(y['5-alpha_reductase_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - LR(5-alpha_reductase_inhibitor)')
plt.show()

### Label --> `nfkb_inhibitor`

In [None]:
parameters = {'n_estimators':[20,40,60,80,100,120], 'max_depth':[2,4,6,8], 'max_features':[10,20,30,40,50]}

clf_4 = GridSearchCV(estimator=clf_b, param_grid=parameters, cv=cv, scoring='balanced_accuracy', return_train_score=True, verbose=5)
clf_4.fit(X_train_n2, y_entren2)

In [None]:
le = LabelEncoder().fit(np.unique(y['nfkb_inhibitor']))

pred_test = clf_4.predict(X_test_n2)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba2,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba2,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba2),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf_4, X_test_n2, y_prueba2, display_labels=np.unique(y['nfkb_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - RF (nfkb_inhibitor)')
plt.show()

# Máquinas de soporte vectorial (SVM)

### Label --> `5-alpha_reductase_inhibitor`

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
np.random.seed(4)

#Número de vecinos a evaluar
gamma=[0.01, 0.1, 1]
param_reg = [0.01, 0.1, 1, 10]

svm = SVC(class_weight = 'balanced')

parameters = {'kernel':['linear','poly','rbf'], 'gamma':gamma, 'C':param_reg}

clf_2 = GridSearchCV(estimator=svm, param_grid = parameters, cv=cv, scoring='balanced_accuracy',return_train_score=True, verbose=5)
clf_2.fit(X_train_n, y_entren)

In [None]:
print(clf_2.best_params_)
print(clf_2.best_score_)

In [None]:
le = LabelEncoder().fit(np.unique(y['5-alpha_reductase_inhibitor']))

pred_test = clf_2.predict(X_test_n)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf_2, X_test_n, y_prueba, display_labels=np.unique(y['5-alpha_reductase_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - LR(5-alpha_reductase_inhibitor)')
plt.show()

### Label --> `nfkb_inhibitor`

In [None]:
gamma=[0.01, 0.1, 1]
param_reg = [0.01, 0.1, 1, 10]

svm = SVC()#class_weight = 'balanced')

parameters = {'kernel':['linear','rbf'], 'gamma':gamma, 'C':param_reg}

clf_3 = GridSearchCV(estimator=svm, param_grid = parameters, cv=cv, scoring='balanced_accuracy',return_train_score=True, verbose=5)
clf_3.fit(X_train_max2, y_entren2)

In [None]:
print(clf_3.best_params_)
print(clf_3.best_score_)

In [None]:
le = LabelEncoder().fit(np.unique(y['nfkb_inhibitor']))

pred_test = clf_3.predict(X_test_max2)
# y_pre = pred_test.reshape(-1)
# y_test_n = y_testnp.reshape(-1)

print(f"Accuracy = {accuracy_score(y_prueba2,pred_test)}")
print(f"Balanced Accuracy = {balanced_accuracy_score(y_prueba2,pred_test)}")

#Las métricas F1, precision and recall requieren que se establezca la convención de cuál es la clase positiva (1)
print(f"F1 = {f1_score(le.transform(y_prueba2),le.transform(pred_test))}")

disp = plot_confusion_matrix(clf_3, X_test_max2, y_prueba2, display_labels=np.unique(y['nfkb_inhibitor']),
                             cmap=plt.cm.Blues, 
                             normalize='true')
disp.ax_.set_title('MC normalizada - RF (nfkb_inhibitor)')
plt.show()

Como se puede observar en las matrices de confusión, a pesar del desbalance que existe en las etiquetas, un valor de 874 fue suficiente para tener un resultado significativo para la variable de salida con más activaciones. Sin embargo, un valor pequeño de activaciones, no le permitieron a los modelos clasificar las activaciones de forma positiva.

Los objetivos a ejecutar próximamente serán:
* Probar modelos de clasificación que permitan darle un peso balance a las clases.
* Estos modelos serán usados para las etiquetas con mayores activaciones.
* Se probará el modeo SMOTE con el objetivo de crear datos sintéticos y poder darle peso a la clase con menos datos.