# Spliting a Dataset
- A dataet will need to split into 3 to develop a model:
    - **Training Set:** To build the model
    - **Validation Set:** To select the parameters of the model
    - **Test Set:** To evaluate the performace of selected parameters
- While some functions could obmit splitting a dataet into 3 by cross-validaiton, always remember to keep a set of unseen data (X_test, y_test) for testing your model.

In sklearn, there are 2 ways to split your data, i.e.:
- Method 1: train_test_split
- Method 2: Cross-validation(CV)


## Libraries

In [3]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# for spliting a dataset into training and testing 
from sklearn.model_selection import train_test_split

#for cross-validation
from sklearn.model_selection import cross_val_score

#similar to cross_val_score but with time
from sklearn.model_selection import cross_validate

#for imbalanced target to be reflected in each fold
from sklearn.model_selection import StratifiedKFold

#randomise KFold in cross-validaiton
from sklearn.model_selection import KFold

#Leave-one-out cross-validation
from sklearn.model_selection import LeaveOneOut

#Shuffle-split cross-validaiton
from sklearn.model_selection import ShuffleSplit

## - Reading in the data
## - Features and target selection

In [4]:
pwd = os.getcwd()
data= os.path.join(pwd, "data.csv")
df = pd.read_csv(data)
features = df[["Pclass", "Sex", "Fare"]]
target = df[["Survived"]]

# Method 1: train_test_split
- a simple splitting function to divide a dataset into 2 by proportion
- by default, a dataset is divided between 0.75 (X_train, y_train) and 0.25 (X_test, y_test)
- use argument test_size= *float or int* for changing the proportion
- use argument stratify=target to make sure proportional target is splitted across datasets, especially when a target is much smaller than the other 
- remember that the test set (i.e. X_test, y_test) should always be reserved as an unseen dataset for the final evaluation of the model.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42, stratify=target)

#merging the training of features and target datasets together
X_train_y_train = pd.merge(X_train, y_train, left_index=True, right_index=True)
X_train_y_train.head()

Unnamed: 0,Pclass,Sex,Fare,Survived
486,1,female,90.0,1
238,2,male,10.5,0
722,2,male,13.0,0
184,3,female,22.025,1
56,2,female,10.5,1


- To explicitly divided a dataset into 3, we first divide a dataset into two as trainval (X_trainval, ytrainval) and test (X_test and y_test).

- After that, datasets of trainval will split again into train (X_train and y_train) and valid (y_train and y_valid) datasets

hence, the folloing code:

In [6]:
# split the whole dataset into trainval set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(features, target, random_state=0, stratify=target)

# within trainval set, split it into train set and valid set
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=0, stratify=y_trainval)

# Method 2: Cross-Validation (CV)

## Method 2: Cross-Validation (CV)
- regarding cv, read here: https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f
- a model is required, hence preprocessing is needed for non-numeric values
- spliting of dataset is may or many not  be randomised (e.g. use KFold to randomise)

### Some of the benefits:
- X and y will  be thoroughly tested under k number of fold, better generalisation of a model can be produced.
- Utilisation of a dataset: train\_test\_split divide a dataset by 1 time (between 75% and 25%, as default) for training and evaluation. Where cross-validation divide a dataset into several subset for training and testing (k-fold, as in cv=5).
- Cross-validation produce a range of scores to indicate the performance of a model in its best and worst case scenarios to new data.  

### Drawback:
- Computational cost, need to train k models instead of a single model. 
	- Personal option: given the speed and efficiency of modern cpu and relatively manageable data size (thousand of rows, dozen of columns max), this drawback should be manageable.  

### Caution:
- Cross-validation  is not a way to build a model that can be applied to new data.
- Does not return a model
- Use for evaluating how well a given algorithm will generalise when trained on a specific dataset. 

In [7]:
# instantiate a model
logreg = LogisticRegression()

#transformer
ct = ColumnTransformer([
    ("onehot", OneHotEncoder(sparse=False), ["Pclass", "Sex"]),
    ("scaling", StandardScaler(),["Fare"])
    ])

#.fit_transform to X_train
X_trainval_fit_trans = ct.fit_transform(X_trainval)

There are quite a lot of cross-validaiton funcitons, the following listed a few:  

1. Standard KFold (cross_val_score and cross_validate)
2. Stratified k-Fold Cross Validaiton - especailly useful when one or more targets are much smaller than the others
3. Cross Validation with KFold (use KFold funciton to fit into cv argument, more control on kfold, e.g. randomise the dataset before splitting)
4. Leave-one-out Cross Validation (1 target vs rest of the dataset in each Fold)
5. Shuffle-split Cross Validation (select a portion for train test on each split)

Normally, item 1 and 2 are good enough for using

(cross-validaiton with groups is not listed here)

## Standard KFold - cross_val_score

In [8]:
score = cross_val_score(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=5)
score

array([0.7761194 , 0.80597015, 0.82835821, 0.78195489, 0.7443609 ])

In [9]:
score.mean()

0.7873527101335428

## Starndard KFold - cross_validate 
- similar to cross_val_socre but with time

(check what's train_score)

In [10]:
res = cross_validate(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=5, return_train_score=True)
res_df = pd.DataFrame(res)
display(res_df)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.004553,0.000304,0.776119,0.790262
1,0.004591,0.000287,0.80597,0.782772
2,0.005341,0.00032,0.828358,0.777154
3,0.004668,0.00025,0.781955,0.788785
4,0.003966,0.000246,0.744361,0.798131


In [11]:
res["test_score"].mean()

0.7873527101335428

## Stratified k-Fold Cross-Validaiton
- imbalance of target, i.e. one result is much larger/smaller than the other, should be reflected in each Fold

In [12]:
kf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

score_kf = cross_val_score(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=kf)
score_kf.mean()

0.7873527101335428

## Cross Validation with KFold

In [26]:
#shuffled the data using arguments shuffle=True, random_state=42
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

score_kfold = cross_val_score(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=kfold)
score_kfold.mean()

0.7872908186341022

## Leave-one-out Cross-Validation

In [14]:
loo = LeaveOneOut()
score_loo = cross_val_score(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=loo)

len(score_loo)

668

In [15]:
score_loo.mean()

0.7874251497005988

## Shuffle-split Cross Validation

In [16]:
shuffle_split = ShuffleSplit(test_size=.5, train_size=.2, n_splits=10)
scores = cross_val_score(logreg, X_trainval_fit_trans, np.ravel(y_trainval), cv=shuffle_split)
scores

array([0.80239521, 0.79041916, 0.79640719, 0.77245509, 0.76347305,
       0.78742515, 0.76946108, 0.75149701, 0.79341317, 0.78443114])

In [17]:
scores.mean()

0.7811377245508981