# Spliting a Dataset

In [130]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

#for spliting a dataset into training and testing 
from sklearn.model_selection import train_test_split

#for cross-validation
from sklearn.model_selection import cross_val_score

#randomise KFold in cross-validaiton
from sklearn.model_selection import KFold

#machine learning model for Cross-Validation
from sklearn.linear_model import LogisticRegression

In [63]:
pwd = os.getcwd()
data = os.path.join(pwd, "data.csv")
df = pd.read_csv(data)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### Feature and target selection

In [95]:
features = df[["Pclass", "Sex","Fare"]]
target = df[["Survived"]]

target_np = target.to_numpy()

# Splitting Dataset

- There are two kinds of methods in spliting a dataset:
    - simple spliting of a dataset into 2
    - spliting a dataset by kfold (see Method 2)
- Both methods will be used throughout machine learning, not one better than the other but situational


## Method 1: train_test_split
- a simple splitting method to divide a dataset into 2 by proportion
- proportionally, a dataset divided between 0.75 (X_train, y_train) and 0.25 (X_test, y_test)
- use argument test_size= *float or int* for changing the proportion

In [129]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

#merging the training of features and target datasets together
X_train_y_train = pd.merge(X_train, y_train, left_index=True, right_index=True)
X_train_y_train.head()

Unnamed: 0,Pclass,Sex,Fare,Survived
298,1,male,30.5,1
884,3,male,7.05,0
247,2,female,14.5,1
478,3,male,7.5208,0
305,1,male,151.55,1


## Method 2: Cross-Validation (CV)
- regarding cv, read here: https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f
- a model is required, hence preprocessing is needed for non-numeric values
- spliting of daatset is not randomised, use KFold

### Some of the benefits:
- X and y will  be thoroughly tested under k number of fold, better generalisation of a model can be produced.
- Utilisation of a dataset: train\_test\_split divide a dataset by 1 time (between 75% and 25%, as default) for training and evaluation. Where cross-validation divide a dataset into several subset for training and testing (k-fold, as in cv=5).
- Cross-validation produce a range of scores to indicate the performance of a model in its best and worst case scenarios to new data.  

### Drawback:
- Computational cost, need to train k models instead of a single model. 
	- Personal option: given the speed and efficiency of modern cpu and relatively manageable data size (thousand of rows, dozen of columns max), this drawback should be manageable.  

### Caution:
- Cross-validation  is not a way to build a model that can be applied to new data.
	- Does not return a model
	- Use for evaluating how well a given algorithm will generalise when trained on a specific dataset. 

In [131]:
# instantiate a model

logreg = LogisticRegression()

In [118]:
ct = ColumnTransformer([
    ("onehot", OneHotEncoder(sparse=False), ["Pclass", "Sex"]),
    ("scaling", StandardScaler(),["Fare"])
    ])

X_fit_trans = ct.fit_transform(features)

In [112]:
score = cross_val_score(logreg, X_fit_trans, np.ravel(target), cv=5)
score

array([0.79329609, 0.80337079, 0.76966292, 0.75842697, 0.78651685])

In [110]:
score.mean()

0.7822547234950725

### Randomise CV with KFold

In [114]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

In [116]:
score_kfold = cross_val_score(logreg, X_fit_trans, np.ravel(target), cv=kfold)
score_kfold

array([0.78212291, 0.76404494, 0.83707865, 0.74719101, 0.80337079])

In [117]:
score_kfold.mean()

0.786761659657272