# Classifier Selection 

Comparing multiple dimensionality reduction techniques on the full dataset, it seems that it might be possible to find a linear classifier that seperates these classes.  I'm going to give two basic classifiers a shot on both the full data and the DMN subset. However, since I have only $200$ samples I'm going to need to do some unsupervised dimensionality reduction on the full set of autocorrelation values. 

In [1]:

import sys
sys.path.append('..') #workaround to deal with directory issues in notebooks

import numpy as np
import pandas as pd 

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import (KNeighborsClassifier,
                               NeighborhoodComponentsAnalysis)

from xgboost import XGBClassifier

from src.models import train_model 
from src.features import load_features


In [2]:
data_dir='../data/'
class_labels,two_class_labels,pos_str,neg_str,clus_co,ar_array,num_regions,num_subjs=load_features(data_dir)

In [9]:
x=np.concatenate((ar_array,pos_str,neg_str,clus_co),axis=0).transpose() # sets up correctly for scikit learn input later


#do model selection based on a test train split
x_train, x_test, y_train, y_test = train_test_split(x, two_class_labels, test_size=.20, random_state=42,stratify=two_class_labels)

cv=StratifiedKFold(n_splits=10,shuffle=True,random_state=42)
scale=StandardScaler()
x_train_sc=scale.fit_transform(x_train)
x_test_sc=scale.transform(x_test)

#pipelines=[pipe_svc,pipe_XGB,pipe_pca_svc,pipe_pca_XGB,pipe_MI_svc,pipe_MI_XGB]

pipelines=["svc_0.0","xgb_0.0"] #current model (crude) model versioning

subsets={"Auto":np.array([*range(num_regions)]),"Pos":np.array([*range(num_regions,2*num_regions)]),"Neg":np.array([*range(2*num_regions,3*num_regions)]),"Clus":np.array([*range(3*num_regions,4*num_regions)]),"All":np.array([*range(4*num_regions)]),"AutoPos":np.array([*range(2*num_regions)])}



In [10]:
fit_models,score,estimator=train_model.train_multi_subset_pipeline(x_train_sc,y_train,cv,subsets,pipelines,save_flag=True)

In [11]:
performance=train_model.score_on_test(x_test_sc,y_test,fit_models,subsets)

In [12]:
print(performance)

           Pipeline     Score  Variance
0      Auto_svc_0.0  0.539886  0.003532
1      Auto_xgb_0.0  0.445869  0.003032
2       Pos_svc_0.0  0.616809  0.005355
3       Pos_xgb_0.0  0.559829  0.004868
4       Neg_svc_0.0  0.467236  0.005936
5       Neg_xgb_0.0  0.599715  0.006677
6      Clus_svc_0.0  0.698006  0.006717
7      Clus_xgb_0.0  0.465812  0.004547
8       All_svc_0.0  0.559829  0.004868
9       All_xgb_0.0  0.502849  0.004093
10  AutoPos_svc_0.0  0.501425  0.002230
11  AutoPos_xgb_0.0  0.445869  0.003032
