## 6.3.1 Sampling Methods using Python
The following Python codes demonstrate how we can perform different sampling methods for model evaluation. In the following examples, we will compare the accuracy performance of a model using the sampling methods:

Hold-out sampling using the Python `train_test_split()` function

k-Fold cross-validation using the Python `KFold()` function;

Leave-one-out sampling using the Python LeaveOneOut() function;

Bootstrapping method using the Python `train_test_split()` function as the Hold-out method but with different random seeds to choose sample building models.

The comments embedded in the codes give descriptions to guide the rationale of the programming logic.

In [1]:
import pandas as pd
import numpy as np
from numpy import mean 
from numpy import abs
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics 

#Loading Dataset
data = pd.read_csv('data/ChurnFinal.csv')

# specify inputs and label
df_inputs = pd.get_dummies(data[['Gender', 'Age', 'PostalCode', 'Cash', 'CreditCard', 
        'Cheque', 'SinceLastTrx', 'SqrtTotal', 'SqrtMax', 'SqrtMin']])
df_label = data['Churn']

# create model
model = DecisionTreeClassifier(criterion = 'entropy', splitter="best", max_depth=5, 
            min_samples_leaf=5, min_samples_split=0.1, random_state=1) 


# ------------ Model evaluation using Hold-out sampling method ------------#
# prepare splitting training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_inputs, df_label, 
            stratify=df_label, test_size=0.3, random_state=1) 
# train models
model.fit(X_train, y_train)
# apply models for predictions
y_predict = model.predict(X_test)
# derive accuracy
acc = metrics.accuracy_score(y_test, y_predict)
# report performance
print('Hold-out method Sampling:  Accuracy =',round(acc,3))


# ------------ Model evaluation using k-Fold sampling method ------------#
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(model, df_inputs.values, df_label.values, scoring='accuracy', 
            cv=cv, n_jobs=-1)
# report performance
print('k-Fold Cross-Validation :  Accuracy = %.3f' % (mean(scores)))


# ------------ Model evaluation using Leave-one-out method ------------#
# prepare leaveoneout procedure
cv = LeaveOneOut()
scores = cross_val_score(model, df_inputs.values, df_label.values, scoring='accuracy', 
            cv=cv, n_jobs=-1)
# report performance
print('Leave-One-Out Sampling  :  Accuracy = %.3f' % (mean(scores)))


# ------------ Model evaluation using Bootstrapping method ------------#
AccuracyValues=[]
n_times=10  # can be any number for sampling
 
# Performing bootstrapping
for i in range(n_times):
    #Split the data into training and testing set
    # by changing the seed value for each iteration
    X_train, X_test, y_train, y_test = train_test_split(df_inputs.values, df_label.values, 
            test_size=0.2, random_state=7+i)
    model = DecisionTreeClassifier(criterion = 'entropy', splitter="best", max_depth=5, 
            min_samples_leaf=5, min_samples_split=0.1, random_state=1) 
    #Creating the model on Training Data
    model.fit(X_train, y_train)
    # apply models for predictions
    y_predict = model.predict(X_test)
 
    #Measuring accuracy on Testing Data
    #Accuracy=100 - (np.mean(np.abs((y_test - y_predict) / y_test)) * 100)
    #Accuracy = (y_test - y_predict) / y_test
    acc = metrics.accuracy_score(y_test, y_predict)
    # Storing accuracy values
    AccuracyValues.append(np.round(acc, 5))

# Result of all bootstrapping trials as averaged accuracy
print('Bootstrapping Sampling  :  Accuracy =',round(np.mean(AccuracyValues),3))

Hold-out method Sampling:  Accuracy = 0.756
k-Fold Cross-Validation :  Accuracy = 0.741
Leave-One-Out Sampling  :  Accuracy = 0.758
Bootstrapping Sampling  :  Accuracy = 0.732


After running the above Python codes, we observe that different sampling methods yield different model accuracy results. Depending on the nature of the data set used for modeling, we choose suitable sampling methods according to their characteristics, the initial available data set size, and maybe the business requirements if there is any specific need, particularly sampling methods.


**NOTE**: For a more detailed explanation of the Python functions to support sampling methods, i.e., `train_test_split()`, `KFold()`, and `LeaveOneOut()`, students can refer to the official websites as follow, respectively:
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html 
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html