# Unit 4 Train-Test Split

### Lesson Introduction

Imagine you built a robot to recognize apples and oranges. But how do you know if it's good at this task? You need to test it on some new apples and oranges it hasn’t seen before. In machine learning, we do something similar by splitting our data into training and test sets. This helps us see how well our model performs on new data.

It helps indicating and preventing **overfitting**. Overfitting is when a machine learning model learns the training data too well, including noise and details that don’t apply to new data. This results in excellent performance on the training set but poor performance on the test set, indicating that the model has memorized specifics rather than understanding general patterns.

Today, we will learn how to split a dataset into training and test sets using the `train_test_split` function from Scikit-learn. By the end of this lesson, you'll know how to prepare your data properly to evaluate your model.

-----

### What is a Train-Test Split

A train-test split is cutting the dataset into two parts: one to train the model and one to test it. The training set helps the model learn patterns, and the test set helps us check if the model is good at predicting new data.

For example, if you have 10 pictures of fruits, you might use 8 to train your robot and 2 to test it. This ensures the robot hasn’t memorized the training pictures but can recognize new ones too.

-----

### The `train_test_split` Function

To split the data, we use the `train_test_split` function from the Scikit-learn library. This function makes it easy to divide your data randomly. Let’s first see how to import what we need:

```python
from sklearn.model_selection import train_test_split
```

-----

### Small Dataset

Let's use a very small dataset. Imagine we have 10 fruit images (features) and their labels (like apple or orange). Here is our dataset:

```python
# Small dataset
X = [[0.1], [0.2], [0.1], [0.5], [0.5], [0.2], [0.2], [0.4], [0.1], [0.2]]  # 10 features
y = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]  # 10 target labels
```

In this example, `X` is your features (like fruit images), and `y` is your target labels (like 0 for 'apple' or 1 for 'orange').

-----

### Splitting the Dataset

Now, let's use the `train_test_split` function to divide our dataset into training and test sets:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(len(X_train))  # 8
print(len(X_test))  # 2
print(len(y_train))  # 8
print(len(y_test))  # 2
```

Here’s what this does:

  * `X_train` and `y_train` are the training sets.
  * `X_test` and `y_test` are the test sets.
  * `test_size=0.2` means 20% of the data is for testing, and 80% is for training. It is common to use 20-30% of your data for the test set.
  * `random_state=42` ensures the split is the same every time you run the code, which is handy for reproducibility. You can use any integer in `random_state`, 42 is just a random choice (or a reference to some book 😉)

-----

### Lesson Summary

In this lesson, we learned why it's important to split our data into training and test sets. We discussed overfitting, which is like memorizing homework answers but failing the test. We then explored the `train_test_split` function, used a small dataset, and split it into training and test sets. Finally, we checked the sizes of our splits to ensure everything was correctly set up.

Great job\! Now, it’s time to practice what you’ve learned. You will get hands-on experience applying the train-test split to different datasets, ensuring you’re ready to evaluate your models correctly. Remember, practice is key to mastering machine learning concepts\!

## Adjusting Test Set Size

Space Explorer, let's tweak the train-test split. Change the test_size to 0.5 instead of 0.3 and see how it changes the sizes of the training and test sets.


```python
from sklearn.model_selection import train_test_split

# Fruit dataset
X = [[0.3], [0.4], [0.4], [0.7], [0.5], [0.3], [0.8], [0.6], [0.4], [0.2]]
y = [1, 1, 0, 1, 0, 0, 1, 1, 0, 0]  # 1 for orange, 0 for apple

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))

```

```python
from sklearn.model_selection import train_test_split

# Fruit dataset
X = [[0.3], [0.4], [0.4], [0.7], [0.5], [0.3], [0.8], [0.6], [0.4], [0.2]]
y = [1, 1, 0, 1, 0, 0, 1, 1, 0, 0]  # 1 for orange, 0 for apple

# Splitting the dataset with test_size changed to 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1) # Changed test_size from 0.3 to 0.5

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))
```

## Debug the Fruit Dataset Split

Hey there, stellar navigator!

Given the code for fruit classification, there's a small bug preventing it from working correctly. Your task is to identify and fix the bug. Once fixed, the code will correctly split the dataset into training and test sets.

Good luck!


```python
from sklearn.model_selection import train_test_split

# Dataset representing features of some fruit images and their labels
fruits = [[0.1], [0.3], [0.2], [0.6], [0.5], [0.4], [0.3], [0.2], [0.4], [0.3]]
labels = [0, 1, 0, 1, 1, 0, 0, 1, 1, 0]  # 0 for apple, 1 for orange

# Splitting the dataset into training and test sets
fruits_train, labels_train, fruits_test, labels_test = train_test_split(fruits, labels, test_size=0.3, random_state=7)

print("Training features:", fruits_train)
print("Test features:", fruits_test)
```

Hello Stellar Navigator\!

Anda telah menemukan bug kecil. Fungsi `train_test_split` dari scikit-learn mengembalikan dua pasang data (`X_train, X_test` dan `y_train, y_test`), bukan empat variabel terpisah secara langsung seperti yang Anda coba lakukan.

**Bug:**
Jumlah variabel yang menerima output dari `train_test_split` tidak cocok dengan jumlah output yang sebenarnya.

**Perbaikan:**
Kita perlu menyesuaikan cara output dari `train_test_split` diterima. Output yang benar adalah `fruits_train, fruits_test, labels_train, labels_test`.

Berikut adalah kode yang telah diperbaiki:

```python
from sklearn.model_selection import train_test_split

# Dataset representing features of some fruit images and their labels
fruits = [[0.1], [0.3], [0.2], [0.6], [0.5], [0.4], [0.3], [0.2], [0.4], [0.3]]
labels = [0, 1, 0, 1, 1, 0, 0, 1, 1, 0]  # 0 for apple, 1 for orange

# Splitting the dataset into training and test sets
# Bug fix: Correctly unpacking the four return values from train_test_split
fruits_train, fruits_test, labels_train, labels_test = train_test_split(fruits, labels, test_size=0.3, random_state=7)

print("Training features:", fruits_train)
print("Test features:", fruits_test)
print("Training labels:", labels_train) # Menambahkan ini untuk kelengkapan
print("Test labels:", labels_test)     # Menambahkan ini untuk kelengkapan
```

**Penjelasan Perbaikan:**

Fungsi `train_test_split` selalu mengembalikan dalam urutan:

1.  Training features (fitur pelatihan)
2.  Test features (fitur pengujian)
3.  Training labels (label pelatihan)
4.  Test labels (label pengujian)

Dengan mengubah urutan variabel penerima menjadi `fruits_train, fruits_test, labels_train, labels_test`, kita memastikan bahwa setiap output yang dikembalikan oleh fungsi dialokasikan ke variabel yang benar, sehingga memungkinkan pemisahan dataset bekerja dengan benar.

## Splitting the Fruit Dataset

Hey, Space Explorer, let's test your skills!

Fill in the missing code to split the dataset into training and test sets, and print their lengths. Let's see how well you can handle this!


```python
from sklearn.model_selection import train_test_split

# Small dataset for fruit classification
X = [[0.1], [0.3], [0.1], [0.6], [0.5], [0.2], [0.3], [0.4], [0.2], [0.5]]  # 10 features
y = [0, 1, 0, 1, 1, 0, 1, 1, 0, 1]  # 10 target labels, 0 for 'apple', 1 for 'orange'

# TODO: Split dataset into training and test sets

# Output the lengths of the splits to verify
print(len(X_train), len(X_test))   # Should print 7 3
print(len(y_train), len(y_test))   # Should print 7 3
```

```python
from sklearn.model_selection import train_test_split

# Small dataset for fruit classification
X = [[0.1], [0.3], [0.1], [0.6], [0.5], [0.2], [0.3], [0.4], [0.2], [0.5]]  # 10 features
y = [0, 1, 0, 1, 1, 0, 1, 1, 0, 1]  # 10 target labels, 0 for 'apple', 1 for 'orange'

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Output the lengths of the splits to verify
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))
```

## Splitting the Iris Dataset

Cosmic effort so far, data voyager! Now, let's get our hands dirty.

Split the iris dataset into training and testing sets using a 30% test size. Complete the missing parts below.

May the stars guide your coding journey!

```python
from sklearn.model_selection import train_test_split

# Small dataset for fruit classification
X = [[0.1], [0.3], [0.1], [0.6], [0.5], [0.2], [0.3], [0.4], [0.2], [0.5]]  # 10 features
y = [0, 1, 0, 1, 1, 0, 1, 1, 0, 1]  # 10 target labels, 0 for 'apple', 1 for 'orange'

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Output the lengths of the splits to verify
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

```

It seems there might be a slight mix-up, data voyager\! The code you provided and are asking to complete is for a "fruit classification" dataset, not the Iris dataset.

To split the **Iris dataset**, we first need to load it. The Iris dataset is a classic and is often available directly from `sklearn.datasets`. Then we can apply the `train_test_split` function.

Here's the completed code to split the Iris dataset with a 30% test size:

```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target # Target labels

# Split dataset into training and test sets with 30% test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Output the lengths of the splits to verify
print("Lengths of Iris dataset splits:")
print(f"X_train length: {len(X_train)}, X_test length: {len(X_test)}")
print(f"y_train length: {len(y_train)}, y_test length: {len(y_test)}")
```