# Repeated k-folds
This notebook will implement a Repeated k-fold. Repeated Kfold would create multiple combinations of train-test split.

**Step 1**: Load the dataset into a pandas dataframe to extract all unique SITE_ID values.

In [4]:
#Import modules for this step
from nilearn import datasets
import pandas as pd
import os

#Fetch data using nilearn.datasets.fetch
abide = datasets.fetch_abide_pcp(data_dir=os.path.join(os.sep,"/home/ubuntu/nai"),
                                 pipeline="cpac",
                                 quality_checked=True)

#Load phenotypic data into pandas dataframe
abide_pheno = pd.DataFrame(abide.phenotypic)

#Create array to hold unique site names
#groups = abide_pheno.SITE_ID.unique()

groups = []
for s in abide_pheno.SITE_ID:
    groups.append(s.decode())

  output = genfromtxt(fname, **kwargs)


**Step 2**: Define the dataset split using built-in scikit-learn methods. In this case, I am using sklearn.model_selection.RepeatedKFold.

In [5]:
#Import modules 
import numpy as np 
from sklearn.model_selection import RepeatedKFold
import prepare_data
import os

#Define data and output directories 
data_dir = os.path.join(os.sep,"/home/ubuntu/nai")
output_dir = data_dir

X, y = prepare_data.prepare_data(data_dir,output_dir)

logo = RepeatedKFold(n_splits=10, n_repeats=2, random_state=2652124)
logo.get_n_splits(X, y)

Loading dataset...
Feature file found.
Running PCA...


20

**Step 3:** Choosing which machine learning classifier to use. We will try four different classifiers in this script.

**Step 3.1:** Support Vector Machines (SVM) - LinearSVC

In [6]:
from sklearn.svm import LinearSVC
import statistics
print("----------------------------------------------------")
print("RepeatedKFold with Linear Support Vector Classification")
print("----------------------------------------------------")

l_svc = LinearSVC(max_iter=10000)

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    l_svc.fit(X_train,y_train)
    acc_score = l_svc.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

----------------------------------------------------
RepeatedKFold with Linear Support Vector Classification
----------------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.7045454545454546
Accuracy score for model 2   0.632183908045977
Accuracy score for model 3   0.6666666666666666
Accuracy score for model 4   0.5862068965517241
Accuracy score for model 5   0.6896551724137931
Accuracy score for model 6   0.7241379310344828
Accuracy score for model 7   0.5172413793103449
Accuracy score for model 8   0.632183908045977
Accuracy score for model 9   0.6666666666666666
Accuracy score fo

**Step 3.2:** k-Nearest Neighbors - KNeighborsClassifier

In [7]:
from sklearn.neighbors import KNeighborsClassifier
import statistics
print("--------------------------------------------------")
print("RepeatedKFold with K-Nearest Neighbors Classification")
print("--------------------------------------------------")

knn = KNeighborsClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    knn.fit(X_train,y_train)
    acc_score = knn.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------------
RepeatedKFold with K-Nearest Neighbors Classification
--------------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.5795454545454546
Accuracy score for model 2   0.5287356321839081
Accuracy score for model 3   0.5402298850574713
Accuracy score for model 4   0.5862068965517241
Accuracy score for model 5   0.5172413793103449
Accuracy score for model 6   0.5747126436781609
Accuracy score for model 7   0.4942528735632184
Accuracy score for model 8   0.5862068965517241
Accuracy score for model 9   0.6206896551724138
Accuracy score for mo

**Step 3.3:** Decision Tree - DecisionTreeClassifier

In [8]:
from sklearn.tree import DecisionTreeClassifier
import statistics
print("--------------------------------------------")
print("RepeatedKFold with Decision Tree Classification")
print("--------------------------------------------")

dt = DecisionTreeClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    dt.fit(X_train,y_train)
    acc_score = dt.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------
RepeatedKFold with Decision Tree Classification
--------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.48863636363636365
Accuracy score for model 2   0.4482758620689655
Accuracy score for model 3   0.4482758620689655
Accuracy score for model 4   0.5402298850574713
Accuracy score for model 5   0.4482758620689655
Accuracy score for model 6   0.5517241379310345
Accuracy score for model 7   0.45977011494252873
Accuracy score for model 8   0.4367816091954023
Accuracy score for model 9   0.5632183908045977
Accuracy score for model 10   0.54022

**Step 3.4:** Random Forests - RandomForestClassifier

In [9]:
from sklearn.ensemble import RandomForestClassifier
import statistics
print("--------------------------------------------")
print("RepeatedKFold with Random Forest Classification")
print("--------------------------------------------")

rf = RandomForestClassifier()

accuracy = []
count = 0
for train_index, test_index in logo.split(X,y,groups): 
    count += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training model ",count)
    rf.fit(X_train,y_train)
    acc_score = rf.score(X_test, y_test)
    accuracy.append(acc_score)

print("Finished training.\n")

#Mean accuracy of self.predict(X) with regard to y for each model
index = 0
for a in accuracy: 
    index += 1
    print("Accuracy score for model", index, " ", a)

#Report the average accuracy for all models 
print("\nAverage accuracy score for all models: ", statistics.mean(accuracy))
print("Maximum accuracy score of all models: ", max(accuracy))
print("Minimum accuracy score of all models: ", min(accuracy))

--------------------------------------------
RepeatedKFold with Random Forest Classification
--------------------------------------------
Training model  1
Training model  2
Training model  3
Training model  4
Training model  5
Training model  6
Training model  7
Training model  8
Training model  9
Training model  10
Training model  11
Training model  12
Training model  13
Training model  14
Training model  15
Training model  16
Training model  17
Training model  18
Training model  19
Training model  20
Finished training.

Accuracy score for model 1   0.5909090909090909
Accuracy score for model 2   0.4942528735632184
Accuracy score for model 3   0.5862068965517241
Accuracy score for model 4   0.5632183908045977
Accuracy score for model 5   0.5172413793103449
Accuracy score for model 6   0.5172413793103449
Accuracy score for model 7   0.45977011494252873
Accuracy score for model 8   0.632183908045977
Accuracy score for model 9   0.5977011494252874
Accuracy score for model 10   0.5977011