# Unit 3 Random Search in Machine Learning

## Lesson Introduction and Goals

Choosing the right parameters in machine learning models can greatly affect their success. Imagine these parameters as cake ingredients: the right amount makes your cake delicious. Similarly, the right parameter settings make your model accurate. Random Search helps find these “right ingredients” by trying random combinations. By the end of this lesson, you will:

  * Understand what Random Search is
  * Learn how to implement it using Scikit-Learn
  * Interpret the results to improve models

## What is Random Search?

Random Search is a technique for tuning parameters by randomly sampling combinations from a given range, like randomly picking recipes to see which cake tastes best. Unlike Grid Search, which tries every possible combination, Random Search is faster because it tries random ones. It’s like flipping through a recipe book and picking random recipes instead of trying every single one.

## Loading and Preparing the Dataset

We’ll use the wine dataset from Scikit-Learn. Let's load it and scale features:

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load real dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)
```

To evaluate our model, we split the dataset into a training set (80%) and a testing set (20%).

```python
from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

## Defining the Parameter Distribution

A parameter grid is a set of parameters you want to try. For Logistic Regression, we’ll tune `C` and `solver`.

```python
# Defining the parameter grid
param_distributions = {
    'C': [0.1, 0.5, 0.75, 1, 5, 10, 25, 50, 75, 100],
    'solver': ['liblinear', 'saga']
}
```

  * `C`: Controls the strength of regularization. Smaller values specify stronger regularization.
  * `solver`: Algorithm used in the optimization problem.

## Performing Random Search

`RandomizedSearchCV` is a Scikit-Learn tool for Random Search. It randomly selects parameter combinations and evaluates their performance.

  * `n_iter`: Number of settings sampled.
  * `cv`: Number of cross-validation splits.

<!-- end list -->

```python
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

# Performing randomized search
random_search = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)
```

## Interpreting the Results

After running the search, find the best parameters and view the best score achieved during cross-validation.

```python
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# Best parameters: {'solver': 'liblinear', 'C': 5}
# Best cross-validation score: 0.992
```

## Calculating the Final Metric on the Testing Dataset

After identifying the best parameters from the Random Search, it’s crucial to evaluate the model on the testing dataset to see how well it generalizes to new, unseen data.

```python
from sklearn.metrics import accuracy_score

# Best model with best parameters from random search
best_model = random_search.best_estimator_

# Predicting on the testing set
y_pred = best_model.predict(X_test)

# Calculating the accuracy on the testing set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy}")

# Test Accuracy: 0.981
```

In this example, the accuracy on the testing set is calculated using the best model obtained from `RandomizedSearchCV`. This final evaluation metric gives an indication of the model's performance on new data.

## Lesson Summary and Practice Introduction

In this lesson, you learned:

  * What Random Search is
  * How to load and split a dataset
  * How to define parameter ranges
  * Implementing Random Search with `RandomizedSearchCV`
  * Interpreting the best parameters and scores

Now it’s your turn to practice\! Apply Random Search to different models and datasets. This will help solidify your understanding. Let’s move on to the practice session\!

## Tuning Iterations in Random Search

Stellar Navigator, let's tweak Random Search! Change the number of iterations n_iter in RandomizedSearchCV from 10 to 5 to see how it affects the results. This small change will show you the importance of the iteration count in hyperparameter tuning.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Load the "cancer cake" recipe data
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split the data into "training batter" and "testing batter"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the "ingredient amounts" for our logistic regression cake
param_distributions = {
    'C': [0.1, 0.5, 0.75, 1, 5, 10, 25, 50, 75, 100],
    'solver': ['liblinear', 'saga']
}

# Create and train our "random recipe picker"
# TODO: adjust number of iterations
random_search = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Display the "best recipe"
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

```

Anda ingin mengubah nilai `n_iter` dalam `RandomizedSearchCV` dari 10 menjadi 5 untuk melihat bagaimana hal tersebut memengaruhi hasil. Berikut adalah kode yang telah disesuaikan:

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Load the "cancer cake" recipe data
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split the data into "training batter" and "testing batter"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the "ingredient amounts" for our logistic regression cake
param_distributions = {
    'C': [0.1, 0.5, 0.75, 1, 5, 10, 25, 50, 75, 100],
    'solver': ['liblinear', 'saga']
}

# Create and train our "random recipe picker"
# TODO: adjust number of iterations
random_search = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions, n_iter=5, cv=5, random_state=42) # n_iter diubah menjadi 5
random_search.fit(X_train, y_train)

# Display the "best recipe"
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")
```

Dengan mengubah `n_iter` menjadi 5, `RandomizedSearchCV` akan mencoba lebih sedikit kombinasi parameter secara acak dibandingkan sebelumnya (10 iterasi). Ini akan menghasilkan waktu eksekusi yang lebih cepat, namun ada kemungkinan model tidak menemukan kombinasi parameter optimal terbaik karena jumlah percobaan yang lebih sedikit. [cite\_start]`n_iter` mengontrol jumlah pengaturan parameter yang dicoba secara acak[cite: 55].

## Fill in the Random Search for Best Parameters

You're doing great, Space Voyager! Now, let's fill in the missing pieces to make this code work. Complete the TODOs to perform a randomized search to find the best parameters for our Logistic Regression model.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_distributions = {
    'C': [0.1, 1, 10, 100],  
    'solver': ['liblinear', 'saga']
}

model = LogisticRegression(max_iter=5000)
# TODO: Initialize Random Search with required parameters
random_search = RandomizedSearchCV(_____)
random_search.fit(X_train, y_train)

# Print the best parameters found
print(f"Best parameters: {random_search.best_params_}")
print(f"Accuracy with the best parameters: {accuracy_score(y_test, random_search.best_estimator_.predict(X_test))}")

```

Tentu, mari lengkapi kode untuk melakukan *Randomized Search* pada model *Logistic Regression*.

Anda perlu menginisialisasi `RandomizedSearchCV` dengan parameter berikut:

  * `estimator`: Model yang ingin Anda optimalkan, dalam kasus ini `model` (yaitu `LogisticRegression(max_iter=5000)`).
  * `param_distributions`: Kamus distribusi parameter yang telah Anda definisikan, yaitu `param_distributions`.
  * `n_iter`: Jumlah kombinasi parameter yang akan dicoba (misalnya, 10, atau sesuai kebutuhan).
  * `cv`: Jumlah *fold* untuk *cross-validation* (misalnya, 5).
  * `random_state`: Untuk memastikan hasil yang dapat direproduksi (misalnya, 42).

Berikut adalah kode yang sudah dilengkapi:

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler # Tambahkan import StandardScaler

# Load dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_distributions = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga']
}

model = LogisticRegression(max_iter=5000)
# TODO: Initialize Random Search with required parameters
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Print the best parameters found
print(f"Best parameters: {random_search.best_params_}")
print(f"Accuracy with the best parameters: {accuracy_score(y_test, random_search.best_estimator_.predict(X_test))}")
```

## Randomized Search for Logistic Regression Parameters

Great work so far; let's continue! Now, fill in the missing pieces to perform a Randomized Search for the best parameters of the LogisticRegression model and calculate the final metric on the testing set.

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load and scale the wine dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid
param_distributions = {
    'C': [0.1, 0.5, 0.75, 1, 5, 10, 25, 50, 75, 100],
    'solver': ['liblinear', 'saga']
}

# TODO: Perform Randomized Search on Logistic Regression model with given parameters
random_search = ____
random_search.fit(X_train, y_train)

# Print the best parameters found and the best score achieved
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# TODO: Calculate and print the accuracy on the test set
test_predictions = ____
test_accuracy = ____
print(f"Test set accuracy: {test_accuracy}")

```

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load and scale the wine dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid
param_distributions = {
    'C': [0.1, 0.5, 0.75, 1, 5, 10, 25, 50, 75, 100],
    'solver': ['liblinear', 'saga']
}

# TODO: Perform Randomized Search on Logistic Regression model with given parameters
random_search = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Print the best parameters found and the best score achieved
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# TODO: Calculate and print the accuracy on the test set
test_predictions = random_search.best_estimator_.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test set accuracy: {test_accuracy}")
```

## Tune the DecisionTree Classifier

Alright, Stellar Navigator, let's tweak our recipe for the best cake ingredients!

Add the missing pieces to tune the DecisionTreeRegressor using Random Search. Don't forget to use cross-validation to find the best combination!


```python
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor

# Load dataset
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for DecisionTreeRegressor
param_distributions = {
    'max_depth': [4, 6, 8, 10, 12],
    'min_samples_split': [2, 5, 10, 20]
}

# TODO: Perform randomized search
# TODO: Fit the random search model using the training data

print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

```

Pesan *error* menunjukkan bahwa `load_boston` telah dihapus dari *scikit-learn* sejak versi 1.2 karena masalah etika terkait *dataset* tersebut. Untuk melanjutkan, Anda perlu mengganti `load_boston` dengan *dataset* alternatif yang disarankan, seperti *California housing dataset* atau *Ames housing dataset*.

Berikut adalah kode yang diperbarui menggunakan `fetch_california_housing` sebagai pengganti `load_boston`, dan melengkapi bagian yang hilang untuk *RandomizedSearchCV*:

```python
from sklearn.datasets import fetch_california_housing # Mengganti load_boston
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor

# Load dataset
# X, y = load_boston(return_X_y=True) # Baris ini dihapus
housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for DecisionTreeRegressor
param_distributions = {
    'max_depth': [4, 6, 8, 10, 12],
    'min_samples_split': [2, 5, 10, 20]
}

# Inisialisasi model DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor(random_state=42)

# TODO: Perform randomized search
random_search = RandomizedSearchCV(estimator=dt_regressor,
                                   param_distributions=param_distributions,
                                   n_iter=10,  # Jumlah kombinasi parameter yang akan dicoba secara acak
                                   cv=5,      # Jumlah cross-validation folds
                                   random_state=42,
                                   n_jobs=-1) # Menggunakan semua core yang tersedia untuk komputasi paralel

# TODO: Fit the random search model using the training data
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")
```

## Implement Model Competition

Great job, Space Explorer! Let's put your new skills to the test. Fill in the missing code to perform RandomizedSearchCV on LogisticRegression and DecisionTree models. The code should create RandomizedSearchCV instances using the provided parameter distributions and fit them to the training data. Implement the random recipe picker and see which one turns out best.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

# Step 1: Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Step 2: Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Defining parameter distributions
param_distributions_lr = {'C': [0.1, 1, 10], 'solver': ['liblinear']}
param_distributions_dt = {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5, 10]}

# Step 4: Performing Random Search
# TODO: Add RandomizedSearchCV for Logistic Regression with given parameter distributions and 2000 max iterations
# TODO: Add RandomizedSearchCV for Decision Tree with given parameter distributions

# Step 5: Fitting the models
random_search_lr.fit(X_train, y_train)
random_search_dt.fit(X_train, y_train)

# Step 6: Displaying the best parameters and scores
print(f"Best Logistic Regression parameters: {random_search_lr.best_params_}")
print(f"Best Logistic Regression score: {random_search_lr.best_score_}")
print(f"Best Decision Tree parameters: {random_search_dt.best_params_}")
print(f"Best Decision Tree score: {random_search_dt.best_score_}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

# Step 1: Getting the dataset
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)

# Step 2: Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Defining parameter distributions
param_distributions_lr = {'C': [0.1, 1, 10], 'solver': ['liblinear']}
param_distributions_dt = {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5, 10]}

# Step 4: Performing Random Search
# TODO: Add RandomizedSearchCV for Logistic Regression with given parameter distributions and 2000 max iterations
random_search_lr = RandomizedSearchCV(LogisticRegression(max_iter=2000), param_distributions_lr, n_iter=len(param_distributions_lr['C']) * len(param_distributions_lr['solver']), cv=5, random_state=42)

# TODO: Add RandomizedSearchCV for Decision Tree with given parameter distributions
random_search_dt = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions_dt, n_iter=len(param_distributions_dt['max_depth']) * len(param_distributions_dt['min_samples_split']), cv=5, random_state=42)

# Step 5: Fitting the models
random_search_lr.fit(X_train, y_train)
random_search_dt.fit(X_train, y_train)

# Step 6: Displaying the best parameters and scores
print(f"Best Logistic Regression parameters: {random_search_lr.best_params_}")
print(f"Best Logistic Regression score: {random_search_lr.best_score_}")
print(f"Best Decision Tree parameters: {random_search_dt.best_params_}")
print(f"Best Decision Tree score: {random_search_dt.best_score_}")
```