# Repeated k-folds (with augmented data)
This notebook will implement a Repeated k-fold. Repeated Kfold would create multiple combinations of train-test split.
This notebook uses a combination of synthentic + real data.

**Step 1**: Load the dataset into a pandas dataframe to extract all unique SITE_ID values.

In [50]:
#Import modules for this step
from nilearn import datasets
import pandas as pd
import os

#Fetch data using nilearn.datasets.fetch
abide = datasets.fetch_abide_pcp(data_dir=os.path.join(os.sep,"/Users/htamvada/nai"),
                                 pipeline="cpac",
                                 quality_checked=True)

#Load phenotypic data into pandas dataframe
abide_pheno = pd.DataFrame(abide.phenotypic)

#Create array to hold unique site names
#groups = abide_pheno.SITE_ID.unique()

groups = []
for s in abide_pheno.SITE_ID:
    groups.append(s.decode())

for i in range(200):
    groups.append('SYNTH')

  output = genfromtxt(fname, **kwargs)


**Step 2**: Define the dataset split using built-in scikit-learn methods. In this case, I am using sklearn.model_selection.StratifiedKFold.

In [51]:
#Import modules 
import numpy as np 
from sklearn.model_selection import RepeatedKFold
import prepare_data
import os

#Define data and output directories 
data_dir = os.path.join(os.sep,"/Users/htamvada/nai")
output_dir = data_dir

X, y = prepare_data.fetch_generated_data(data_dir,output_dir)

logo = RepeatedKFold(n_splits=10, n_repeats=2, random_state=2652124)
logo.get_n_splits(X, y)

Loading dataset...
Stacked synthetic and original features  (1071, 2016)
Running PCA...


20

In [52]:
X.shape

(1071, 686)

In [53]:
len(groups)

1071

In [54]:
len(y)

1071

**Step 3:** Choosing which machine learning classifier to use. We will try four different classifiers in this script.

**Step 3.1:** Support Vector Machines (SVM) - LinearSVC

In [55]:
from sklearn.svm import LinearSVC
import statistics
print("----------------------------------------------------")
print("StratifiedKFold with Linear Support Vector Classification")
print("----------------------------------------------------")

l_svc = LinearSVC(max_iter=10000)

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    #print("TRAIN:", train_index, "TEST:", test_index)
    l_svc.fit(X_train,y_train)
    acc_score = l_svc.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

----------------------------------------------------
StratifiedKFold with Linear Support Vector Classification
----------------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.7129629629629629
Accuracy score for model 2   0.7009345794392523
Accuracy score for model 3   0.7570093457943925
Accuracy score for model 4   0.7289719626168224
Accuracy score for model 5   0.7289719626168224
Accuracy score for model 6   0.7009345794392523
Accuracy score for model 7   0.6635514018691588
Accuracy score for model 8   0.6355140186915887
Accuracy score for model 9   0.6542056074766355
Accuracy scor

**Step 3.2:** k-Nearest Neighbors - KNeighborsClassifier

In [56]:
from sklearn.neighbors import KNeighborsClassifier
import statistics
print("--------------------------------------------------")
print("GroupKFold with K-Nearest Neighbors Classification")
print("--------------------------------------------------")

knn = KNeighborsClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    knn.fit(X_train,y_train)
    acc_score = knn.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------------
GroupKFold with K-Nearest Neighbors Classification
--------------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.49074074074074076
Accuracy score for model 2   0.4672897196261682
Accuracy score for model 3   0.5327102803738317
Accuracy score for model 4   0.48598130841121495
Accuracy score for model 5   0.45794392523364486
Accuracy score for model 6   0.6074766355140186
Accuracy score for model 7   0.4392523364485981
Accuracy score for model 8   0.5887850467289719
Accuracy score for model 9   0.5607476635514018
Accuracy score for mo

**Step 3.3:** Decision Tree - DecisionTreeClassifier

In [57]:
from sklearn.tree import DecisionTreeClassifier
import statistics
print("--------------------------------------------")
print("GroupKFold with Decision Tree Classification")
print("--------------------------------------------")

dt = DecisionTreeClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    dt.fit(X_train,y_train)
    acc_score = dt.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------
GroupKFold with Decision Tree Classification
--------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.6018518518518519
Accuracy score for model 2   0.6074766355140186
Accuracy score for model 3   0.6074766355140186
Accuracy score for model 4   0.6261682242990654
Accuracy score for model 5   0.5607476635514018
Accuracy score for model 6   0.6261682242990654
Accuracy score for model 7   0.5981308411214953
Accuracy score for model 8   0.616822429906542
Accuracy score for model 9   0.5794392523364486
Accuracy score for model 10   0.66355140186

**Step 3.4:** Random Forests - RandomForestClassifier

In [58]:
from sklearn.ensemble import RandomForestClassifier
import statistics
print("--------------------------------------------")
print("GroupKFold with Random Forest Classification")
print("--------------------------------------------")

rf = RandomForestClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    rf.fit(X_train,y_train)
    acc_score = rf.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------
GroupKFold with Random Forest Classification
--------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.6111111111111112
Accuracy score for model 2   0.5794392523364486
Accuracy score for model 3   0.6542056074766355
Accuracy score for model 4   0.5981308411214953
Accuracy score for model 5   0.6542056074766355
Accuracy score for model 6   0.514018691588785
Accuracy score for model 7   0.6074766355140186
Accuracy score for model 8   0.6635514018691588
Accuracy score for model 9   0.6915887850467289
Accuracy score for model 10   0.67289719626