Use the wine dataset to perform a simple split validation with different training and test ratios:
- 50% training data  / 50% test data
- 60% training data  / 40% test data
- 70% training data  / 30% test data
- 80% training data  / 20% test data


1. Randomly split iris data data into training and test sample using the train_test_split function
2. Print the number of resulting training and test examples to check
3. Train an SVM classifier using the training sample
4. Apply the trained model to the test sample, calculating the accuracy (use the score method of the classifier)
5. Iterate over the different training/test splits printing the test accuracies

In [17]:
#Different splitting and validation approaches
#Simple split validation
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

#load wine data set
x, y = datasets.load_wine(return_X_y=True)
print("Shape of original Wine Data Set: ",x.shape, y.shape)

test_splits=[0.5,0.4,0.3,0.2]

for test_split in test_splits:
    #Randomly split data into training and test sample (data is shuffled by default)
    #Data is sampled without replacement
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_split, random_state=1)

    #Print the number of resulting training and test examples to check
    print("\nTraining examples: ",x_train.shape[0],", test examples: ",x_test.shape[0])

    #Train an SVM classifier using the training sample
    clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
    #Apply the trained model to the test sample, calculating the accuracy (use the score method of the classifier)
    print("Accuracy of classifier on ",test_split*100,"% test sample: ",clf.score(x_test, y_test))

Shape of original Wine Data Set:  (178, 13) (178,)

Training examples:  89 , test examples:  89
Accuracy of classifier on  50.0 % testing sample:  0.9550561797752809

Training examples:  106 , test examples:  72
Accuracy of classifier on  40.0 % testing sample:  0.9583333333333334

Training examples:  124 , test examples:  54
Accuracy of classifier on  30.0 % testing sample:  0.9629629629629629

Training examples:  142 , test examples:  36
Accuracy of classifier on  20.0 % testing sample:  0.9444444444444444


Now, we will add stratified sampling which enforces a representative sample concerning classes

Encapsulate the split validation you just created into a function named "split_validation"
- parameters: features (the features), label (the label), test_split (the testing ratio), stratify_flag (y or n)
- no return value

Within the split_validation function, modify the parameters of the train_test_split function using the parameter "stratify" to enable stratified sampling.

Call the function from a for loop using different test_splits again like in the last task and let the function calculate and print test accuracies again.

In [26]:
#Variation Stratified Sampling: Enforces a representative sample concerning classes

def split_validation(features, label, test_split,stratify_flag):
    #Randomly split data into training and test sample (data is shuffled by default)
    #Data is sampled without replacement
    x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=test_split, random_state=1, stratify=stratify_flag)

    #Print the number of resulting training and test examples to check
    print("\nTraining examples: ",x_train.shape[0],", test examples: ",x_test.shape[0])

    #Train an SVM classifier using the training sample
    clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
    #Apply the trained model to the test sample, calculating the accuracy (use the score method of the classifier)
    print("Accuracy of classifier on ",test_split*100,"% test sample: ",clf.score(x_test, y_test))

for test_split in test_splits:
    split_validation(x, y, test_split,y)


Training examples:  89 , test examples:  89
Accuracy of classifier on  50.0 % test sample:  0.9213483146067416

Training examples:  106 , test examples:  72
Accuracy of classifier on  40.0 % test sample:  0.9305555555555556

Training examples:  124 , test examples:  54
Accuracy of classifier on  30.0 % test sample:  0.9259259259259259

Training examples:  142 , test examples:  36
Accuracy of classifier on  20.0 % test sample:  0.9722222222222222


In practice, data is often split into three samples:
- training sample (just for training)
- validation sample (to check the accuracy of different models during the training phase)
- test sample (to check the accuracy of the final model)

Use the train_test_split to split the wine dataset into 75% training data, 15% validation data and 10% test data!

Remark: You will not get the exact splits required - think about why!

In [38]:
#Variation Train-Validation-Test Sampling
train_split=0.75
test_split=0.1
val_split=1-train_split-test_split

#Split wine data set into training, test and validation examples
x_train_val, x_test, y_train_val, y_test = train_test_split(x, y, test_size=test_split, random_state=1)
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, test_size=val_split/(train_split+val_split), random_state=1)

#Print Overall number of examples and calculated! percentages of the splits 
total=x.shape[0]
print("Overall number of examples in wine data set",total)
print("Calculated percentage of training examples: ",x_train.shape[0]/total*100,"%")
print("Calculated percentage of validation examples: ",x_val.shape[0]/total*100,"%")
print("Calculated percentage of test examples: ",x_test.shape[0]/total*100,"%")

Overall number of examples in wine data set 178
Calculated percentage of training examples:  74.71910112359551 %
Calculated percentage of validation examples:  15.168539325842698 %
Calculated percentage of test examples:  10.112359550561797 %


In Cross Validation, training data is split multiple times into training and validation sub sets.

Cross validation resamples without replacement and is primarily used to verify the hyperparameters of the model (and not as an ensemble method)


1. Define an SVM classifier using the svm.SVC class

2. Perform cross validation with k=5 using cross_val_score

3. Print the calculated accuracies / scores

4. Print the mean value of the calculated accuracies and its standard deviation


In [77]:
#K-Fold Cross Validation
from sklearn.model_selection import cross_val_score
from sklearn import svm

#Define an SVM classifier using the svm.SVC class
clf = svm.SVC(kernel='linear', C=1)

#Perform cross validation with k=5 using cross_val_score
scores = cross_val_score(clf, x, y, cv=5)
#Print the calculated accuracies / scores
print("Accuracies: ",scores)
#Print the mean value of the calculated accuracies and its standard deviation
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracies:  [0.96666667 1.         0.96666667 0.96666667 1.        ]
Accuracy: 0.98 (+/- 0.03)


If we want to combine models that are trained on different subsets of the training examples, we can use a variation of the Bagging (Bootstrap AGGregatING) approach

1. Load iris data set (150 examples, can be split easier to proper samples :)

2. Split iris data set into training and test examples using train_test_split as before test split=0.2

3. Define a new classifier using the BaggingClassifier class with k=6 folds and bootstrap=False

4. Fit your bagging classifier

5. Apply the trained bagging model on test data

6. Combine features, label and predicted label to a new 2-dim array and print the result


In [78]:
from sklearn.ensemble import BaggingClassifier

#load iris data set (150 examples, can be split easier to proper samples :)
x, y = datasets.load_iris(return_X_y=True)
print("Shape of original iris Data Set: ",x.shape, y.shape)

#Split iris data set into training and test examples using train_test_split as before test split=0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

#Define a new classifier using the BaggingClassifier class with k=6 folds and bootstrap=False
#Bagging without replacement (bootstrap=False) resembles the fitting performed during K-Fold cross validation
clf = BaggingClassifier(base_estimator=svm.SVC(),n_estimators=6, random_state=0,bootstrap=False)
#Fit your bagging classifier
clf.fit(x_train, y_train)

#Apply the trained bagging model on test data 
y_test_pred=clf.predict(x_test)
#Combine features, label and predicted label to a new 2-dim array and print the result
test_comp=np.concatenate((x_test.reshape(30,4),y_test.reshape(30,1),y_test_pred.reshape(30,1)),axis=1)
print(test_comp)

Shape of original iris Data Set:  (150, 4) (150,)
[[5.8 4.  1.2 0.2 0.  0. ]
 [5.1 2.5 3.  1.1 1.  1. ]
 [6.6 3.  4.4 1.4 1.  1. ]
 [5.4 3.9 1.3 0.4 0.  0. ]
 [7.9 3.8 6.4 2.  2.  2. ]
 [6.3 3.3 4.7 1.6 1.  1. ]
 [6.9 3.1 5.1 2.3 2.  2. ]
 [5.1 3.8 1.9 0.4 0.  0. ]
 [4.7 3.2 1.6 0.2 0.  0. ]
 [6.9 3.2 5.7 2.3 2.  2. ]
 [5.6 2.7 4.2 1.3 1.  1. ]
 [5.4 3.9 1.7 0.4 0.  0. ]
 [7.1 3.  5.9 2.1 2.  2. ]
 [6.4 3.2 4.5 1.5 1.  1. ]
 [6.  2.9 4.5 1.5 1.  1. ]
 [4.4 3.2 1.3 0.2 0.  0. ]
 [5.8 2.6 4.  1.2 1.  1. ]
 [5.6 3.  4.5 1.5 1.  1. ]
 [5.4 3.4 1.5 0.4 0.  0. ]
 [5.  3.2 1.2 0.2 0.  0. ]
 [5.5 2.6 4.4 1.2 1.  1. ]
 [5.4 3.  4.5 1.5 1.  1. ]
 [6.7 3.  5.  1.7 1.  2. ]
 [5.  3.5 1.3 0.3 0.  0. ]
 [7.2 3.2 6.  1.8 2.  2. ]
 [5.7 2.8 4.1 1.3 1.  1. ]
 [5.5 4.2 1.4 0.2 0.  0. ]
 [5.1 3.8 1.5 0.3 0.  0. ]
 [6.1 2.8 4.7 1.2 1.  1. ]
 [6.3 2.5 5.  1.9 2.  2. ]]


In contrast to resampling without replacement (which we just performed), bootstrapping resamples with replacement.

This means, that an example can be part of multiple bootstrap samples.

Let's demonstrate this with a simple data set

1. Perform bootstrapping with 10 samples using a loop

2. To create a single sample, use the resample function drawing 4 examples with replacement

3. Print the example within your sample
   
4. Print the out of bag observations - any example, which is in the original data, but not in the sample

In [79]:
#Bootstrapping
from sklearn.utils import resample

# data sample
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
print('Original data: ',data)

print("Bootstrapping with k=10. \nNumber of instances in OOB sample depends on number of repeatedly drawn instances in bootstrap sample!")

#Perform bootstrapping with 10 samples using a loop
for i in range(1,11):
#To create a single sample, use the resample function drawing 4 examples with replacement
    boot = resample(data, replace=True, n_samples=4)
#Print the example within your sample
    print('\nk= %i Bootstrap Sample: %s' % (i,boot))

# Print the out of bag observations - any example, which is in the original data, but not in the sample
    oob = [x for x in data if x not in boot]
    print('OOB Sample: %s' % oob)
    

Original data:  [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
Bootstrapping with k=10. 
Number of instances in OOB sample depends on number of repeatedly drawn instances in bootstrap sample!

k= 1 Bootstrap Sample: [0.6, 0.4, 0.6, 0.3]
OOB Sample: [0.1, 0.2, 0.5]

k= 2 Bootstrap Sample: [0.4, 0.2, 0.5, 0.3]
OOB Sample: [0.1, 0.6]

k= 3 Bootstrap Sample: [0.4, 0.3, 0.4, 0.1]
OOB Sample: [0.2, 0.5, 0.6]

k= 4 Bootstrap Sample: [0.3, 0.6, 0.4, 0.6]
OOB Sample: [0.1, 0.2, 0.5]

k= 5 Bootstrap Sample: [0.1, 0.6, 0.5, 0.1]
OOB Sample: [0.2, 0.3, 0.4]

k= 6 Bootstrap Sample: [0.1, 0.4, 0.3, 0.5]
OOB Sample: [0.2, 0.6]

k= 7 Bootstrap Sample: [0.5, 0.2, 0.4, 0.6]
OOB Sample: [0.1, 0.3]

k= 8 Bootstrap Sample: [0.5, 0.4, 0.1, 0.4]
OOB Sample: [0.2, 0.3, 0.6]

k= 9 Bootstrap Sample: [0.1, 0.6, 0.4, 0.6]
OOB Sample: [0.2, 0.3, 0.5]

k= 10 Bootstrap Sample: [0.3, 0.6, 0.6, 0.1]
OOB Sample: [0.2, 0.4, 0.5]
