In [10]:
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, StratifiedKFold, GroupKFold
import numpy as np

Why do we split data for training and testing separately?

* See [examples](https://arxiv.org/pdf/2109.06827.pdf)
* I.I.D. (VERY IMPORTANT!!!)

## Data Splitting
Train/test split by  `train_test_split`:
* By default, `train_test_split` splits the data into 75% training data and 25% test data which is a good rule of thumb.
* `stratify=y` makes sure that the labels to be distributed in train and test sets as they are in the original dataset.

In [13]:
X = ["The movie was a dull and uninteresting depiction of a fascinating historical event.",
"A beautifully crafted masterpiece that captures the essence of childhood adventure.",
"I was excited to see this film because I love historical dramas, but it was a huge disappointment. The storyline was disjointed, and it seemed like the director was trying too hard to be artsy. The acting was mediocre at best, and I found myself checking my watch multiple times throughout. Overall, a very underwhelming experience.",
"This movie is a true gem. The storyline was gripping from start to finish, filled with unexpected twists and turns. The performances were top-notch, with the lead actors delivering some of their career-best performances. The cinematography was beautiful, and the soundtrack perfectly complemented the mood of the film. It's a must-watch for anyone who appreciates quality cinema."]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


[['The movie was a dull and uninteresting depiction of a fascinating historical event.',
  'I was excited to see this film because I love historical dramas, but it was a huge disappointment. The storyline was disjointed, and it seemed like the director was trying too hard to be artsy. The acting was mediocre at best, and I found myself checking my watch multiple times throughout. Overall, a very underwhelming experience.'],
 ['A beautifully crafted masterpiece that captures the essence of childhood adventure.',
  "This movie is a true gem. The storyline was gripping from start to finish, filled with unexpected twists and turns. The performances were top-notch, with the lead actors delivering some of their career-best performances. The cinematography was beautiful, and the soundtrack perfectly complemented the mood of the film. It's a must-watch for anyone who appreciates quality cinema."],
 array([1, 3]),
 array([2, 4])]

## Corss Validation
* KFold, GroupKFold
* [Visualizing cross-validation behavior in scikit-learn](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html)
<!-- * [Caveats of Cross-validation](https://www.bing.com/search?q=7+cross+validation+mistaks&cvid=de14ffd0df5042cc883c14bea38f3da1&aqs=edge..69i57j0j69i11004.12424j0j4&FORM=ANAB01&PC=NSJS) -->


In [8]:
X = np.array([[1, 2], 
               [3, 4], 
               [1, 2], 
               [3, 4]])
# cv = KFold(n_splits=2)
# cv = LeaveOneOut()
# cv = GroupKFold(n_splits=2)
cv = StratifiedKFold(n_splits=2)
print(list(cv.split(X, y=[0, 1, 0, 1])))

[(array([2, 3]), array([0, 1])), (array([0, 1]), array([2, 3]))]


<!-- ## Hyperparameter Tuning
Methods for Data Split
* Subsampling
* Stratified subsampling
* [Grid Search](https://scikit-learn.org/stable/modules/grid_search.html) -->