# Activity: Build a random forest model

## **Introduction**


As you're learning, random forests are popular statistical learning algorithms. Some of their primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

## **Step 1: Imports** 


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:
- The module `ensemble`, which has the function `RandomForestClassifier`
- The module `model_selection`, which has the functions `train_test_split`, `PredefinedSplit`, and `GridSearchCV` 
- The module `metrics`, which has the functions `f1_score`, `precision_score`, `recall_score`, and `accuracy_score`


In [1]:
# Data processing
import numpy as np
import pandas as pd
import pickle

# Model building
from sklearn.ensemble import RandomForestClassifier

# Model selection and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA. 

### YOUR CODE HERE ###

air_data = pd.read_csv("Invistico_Airline.csv")

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `read_csv()` function from the `pandas` library can be helpful here.
 
</details>

Now, you're ready to begin cleaning your data. 

## **Step 2: Data cleaning** 

To get a sense of the data, display the first 10 rows.

In [3]:
# Display the first 10 rows of the dataset
air_data.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `head()` function from the `pandas` library can be helpful here.
 
</details>

Now, display the variable names and their data types. 

In [4]:
# Display variable names and their data types
print(air_data.dtypes)

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

DataFrames have an attribute that outputs variable names and data types in one result.
 
</details>

**Question:** What do you observe about the differences in data types among the variables included in the data?

In the dataset, we observe two main types of variables:

Categorical variables (object type): These include satisfaction, Customer Type, Type of Travel, and Class. These are non-numeric and represent categories or labels. They will need to be encoded (e.g., using one-hot encoding or label encoding) before they can be used in a machine learning model.

Numerical variables (int64 and float64): The rest of the variables are numerical, representing ratings (e.g., Seat comfort, Cleanliness), distances (e.g., Flight Distance), delays, or age. These are already in a suitable format for modeling but may require scaling or normalization depending on the model being used.

Key Takeaway:
Categorical features must be converted to numerical form, and missing or float values (like in Arrival Delay in Minutes) should be handled to ensure model performance and stability.

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [5]:
# Identify the number of rows and the number of columns
print(air_data.shape)

(129880, 22)


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

There is a method in the `pandas` library that outputs the number of rows and the number of columns in one result.

</details>

Now, check for missing values in the rows of the data. Start with .isna() to get Booleans indicating whether each value in the data is missing. Then, use .any(axis=1) to get Booleans indicating whether there are any missing values along the columns in each row. Finally, use .sum() to get the number of rows that contain missing values.

In [6]:
# Step 1: Get Booleans to find missing values in data
missing_values = air_data.isna()

# Step 2: Get Booleans to find missing values along columns (across each row)
rows_with_missing_values = missing_values.any(axis=1)

# Step 3: Get the number of rows that contain missing values
num_rows_with_missing_values = rows_with_missing_values.sum()

# Print the result
print(f"Number of rows with missing values: {num_rows_with_missing_values}")

Number of rows with missing values: 393


**Question:** How many rows of data are missing values?**

The dataset has 393 rows that contain missing values.

You can address these missing values using techniques such as imputation (filling in missing values with the mean, median, or mode), or by removing rows with missing values, depending on your modeling strategy.

Drop the rows with missing values. This is an important step in data cleaning, as it makes the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [7]:
# Drop rows with missing values
air_data_subset = air_data.dropna(axis=0)

# Display the shape of the new DataFrame to confirm
print(air_data_subset.shape)

(129487, 22)


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

The `dropna()` function is helpful here.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The axis parameter passed in to this function should be set to 0 (if you want to drop rows containing missing values) or 1 (if you want to drop columns containing missing values).
</details>

Next, display the first 10 rows to examine the data subset.

In [8]:
# Display the first 10 rows of the cleaned data
air_data_subset.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [9]:
# Count of missing values in the dataset
missing_values = air_data_subset.isna().sum()

# Display the count of missing values per column
print(missing_values)

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `.isna().sum()` to get the number of missing values for each variable.

</details>

Next, convert the categorical features to indicator (one-hot encoded) features. 

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [10]:
# Convert categorical features to one-hot encoded features
air_data_subset_encoded = pd.get_dummies(air_data_subset, drop_first=False)

# Display the first 10 rows of the one-hot encoded dataset
air_data_subset_encoded.head(10)

Unnamed: 0,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,...,Arrival Delay in Minutes,satisfaction_dissatisfied,satisfaction_satisfied,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,65,265,0,0,0,2,2,4,2,3,...,0.0,0,1,1,0,0,1,0,1,0
1,47,2464,0,0,0,3,0,2,2,3,...,305.0,0,1,1,0,0,1,1,0,0
2,15,2138,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
3,60,623,0,0,0,3,3,4,3,1,...,0.0,0,1,1,0,0,1,0,1,0
4,70,354,0,0,0,3,4,3,4,2,...,0.0,0,1,1,0,0,1,0,1,0
5,30,1894,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
6,66,227,0,0,0,3,2,5,5,5,...,15.0,0,1,1,0,0,1,0,1,0
7,10,1812,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
8,56,73,0,0,0,3,5,3,5,4,...,0.0,0,1,1,0,0,1,1,0,0
9,22,1556,0,0,0,3,2,0,2,2,...,26.0,0,1,1,0,0,1,0,1,0


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `pd.get_dummies()` function to convert categorical variables to one-hot encoded variables.
</details>

**Question:** Why is it necessary to convert categorical data into dummy variables?**

It is necessary to convert categorical data into dummy variables because machine learning algorithms, including random forests, can only process numerical data. Most algorithms do not work with categorical variables in their original form, such as strings or categories. Dummy variables (also known as one-hot encoding) help convert these categorical values into binary (0 or 1) columns, each representing the presence or absence of a category.

By creating dummy variables, each category in a feature is represented by its own column, where:

A "1" indicates the presence of that category.

A "0" indicates the absence of that category.

This transformation allows algorithms to interpret categorical data as numerical information and capture relationships between different categories. For example, the variable "Class" (with categories like "Eco," "Business," and "Eco Plus") is converted into separate columns for each class type, allowing the model to understand these as distinct values.

In summary, dummy variables make categorical features interpretable for machine learning models and help to ensure that the model can learn from them effectively.

Next, display the first 10 rows to review the `air_data_subset_dummies`. 

In [11]:
# Display the first 10 rows of the one-hot encoded dataset
air_data_subset_encoded.head(10)

Unnamed: 0,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,...,Arrival Delay in Minutes,satisfaction_dissatisfied,satisfaction_satisfied,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,65,265,0,0,0,2,2,4,2,3,...,0.0,0,1,1,0,0,1,0,1,0
1,47,2464,0,0,0,3,0,2,2,3,...,305.0,0,1,1,0,0,1,1,0,0
2,15,2138,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
3,60,623,0,0,0,3,3,4,3,1,...,0.0,0,1,1,0,0,1,0,1,0
4,70,354,0,0,0,3,4,3,4,2,...,0.0,0,1,1,0,0,1,0,1,0
5,30,1894,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
6,66,227,0,0,0,3,2,5,5,5,...,15.0,0,1,1,0,0,1,0,1,0
7,10,1812,0,0,0,3,2,0,2,2,...,0.0,0,1,1,0,0,1,0,1,0
8,56,73,0,0,0,3,5,3,5,4,...,0.0,0,1,1,0,0,1,1,0,0
9,22,1556,0,0,0,3,2,0,2,2,...,26.0,0,1,1,0,0,1,0,1,0


Then, check the variables of air_data_subset_dummies.

In [12]:
# Display the column names of the one-hot encoded dataset
print(air_data_subset_encoded.columns)

Index(['Age', 'Flight Distance', 'Seat comfort',
       'Departure/Arrival time convenient', 'Food and drink', 'Gate location',
       'Inflight wifi service', 'Inflight entertainment', 'Online support',
       'Ease of Online booking', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding',
       'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'satisfaction_dissatisfied', 'satisfaction_satisfied',
       'Customer Type_Loyal Customer', 'Customer Type_disloyal Customer',
       'Type of Travel_Business travel', 'Type of Travel_Personal Travel',
       'Class_Business', 'Class_Eco', 'Class_Eco Plus'],
      dtype='object')


**Question:** What changes do you observe after converting the string data to dummy variables?**

After converting the categorical string data into dummy variables, you observe the following changes:

New Columns for Categorical Variables: Each categorical feature (such as satisfaction, Customer Type, Type of Travel, and Class) has been split into multiple columns, where each possible category is represented by a new binary column. For example:

The satisfaction column is split into satisfaction_dissatisfied and satisfaction_satisfied, where each column contains 0 or 1 values indicating whether the row corresponds to that particular category.

The Customer Type column is split into Customer Type_Loyal Customer and Customer Type_disloyal Customer.

Similarly, Type of Travel is split into Type of Travel_Business travel and Type of Travel_Personal Travel, and Class is split into Class_Business, Class_Eco, and Class_Eco Plus.

Removal of String Variables: The original categorical columns (such as satisfaction, Customer Type, Type of Travel, and Class) are no longer present in their original form as string variables. They have been replaced by their one-hot encoded counterparts, which are now numeric (0 or 1).

Increase in Column Count: The total number of columns has increased from the original dataset due to the addition of new dummy variable columns for each categorical feature.

Numeric Representation: The new columns contain 0 or 1 values, making them suitable for machine learning models like Random Forest, which require numerical input data.

These transformations are important for machine learning algorithms since most models cannot directly handle categorical string data, and thus we need to convert it into a numeric format (such as one-hot encoding) for the model to process the data effectively.

## **Step 3: Model building** 

The first step to building your model is separating the labels (y) from the features (X).

In [13]:
# Separate the dataset into labels (y) and features (X)
y = air_data_subset_encoded['satisfaction_satisfied']  # The target variable
X = air_data_subset_encoded.drop(columns=['satisfaction_satisfied', 'satisfaction_dissatisfied'])  # Features

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Save the labels (the values in the `satisfaction` column) as `y`.

Save the features as `X`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

To obtain the features, drop the `satisfaction` column from the DataFrame.

</details>

Once separated, split the data into train, validate, and test sets. 

In [14]:
from sklearn.model_selection import train_test_split

# First, split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Then, split the training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `train_test_split()` function twice to create train/validate/test sets, passing in `random_state` for reproducible results. 

</details>

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Split `X`, `y` to get `X_train`, `X_test`, `y_train`, `y_test`. Set the `test_size` argument to the proportion of data points you want to select for testing. 

Split `X_train`, `y_train` to get `X_tr`, `X_val`, `y_tr`, `y_val`. Set the `test_size` argument to the proportion of data points you want to select for validation. 

</details>

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Create a dictionary to define the hyperparameters and their respective values
cv_params = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider for the best split
    'max_samples': [None, 0.8, 0.9]  # Fraction of samples to train each tree on (useful for very large datasets)
}

# Instantiate the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Perform grid search with cross-validation on the training data
grid_search = GridSearchCV(estimator=rf, param_grid=cv_params, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Display the best parameters found by GridSearchCV
print("Best parameters found by GridSearchCV:", grid_search.best_params_)

Fitting 5 folds for each of 972 candidates, totalling 4860 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed: 21.9min
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed: 59.0min


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Create a dictionary `cv_params` that maps each hyperparameter name to a list of values. The GridSearch you conduct will set the hyperparameter to each possible value, as specified, and determine which value is optimal.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The main hyperparameters here include `'n_estimators', 'max_depth', 'min_samples_leaf', 'min_samples_split', 'max_features', and 'max_samples'`. These will be the keys in the dictionary `cv_params`.

</details>

Next, create a list of split indices.

In [None]:
from sklearn.model_selection import KFold

# Create a KFold cross-validation object with n_splits set to the number of splits you want
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5 splits is just an example

# Create a list of split indices
split_indices = list(kf.split(X_train))  # X_train is your feature set for training

# Display the first split for reference
print(split_indices[0])

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use list comprehension, iterating over the indices of `X_train`. The list can consists of 0s to indicate data points that should be treated as validation data and -1s to indicate data points that should be treated as training data.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use `PredfinedSplit()`, passing in `split_index`, saving the output as `custom_split`. This will serve as a custom split that will identify which data points from the train set should be treated as validation data during GridSearch.

</details>

Now, instantiate your model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate the random forest model with a random_state for reproducibility
rf = RandomForestClassifier(random_state=42)

# Display the model to confirm instantiation
print(rf)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results. This will help you instantiate a random forest model, `rf`.

</details>

Next, use GridSearchCV to search over the specified parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

# Use GridSearchCV to search over specified parameters
grid_search = GridSearchCV(
    estimator=rf,                # The RandomForest model
    param_grid=cv_params,        # The hyperparameter grid
    cv=custom_split,             # Custom cross-validation split
    refit='f1',                  # Use f1 score for refitting the model
    n_jobs=-1,                   # Use all CPU cores for parallel processing
    verbose=1                    # Display progress
)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `GridSearchCV()`, passing in `rf` and `cv_params` and specifying `cv` as `custom_split`. Additional arguments that you can specify include: `refit='f1', n_jobs = -1, verbose = 1`. 

</details>

Now, fit your model.

In [None]:
%%time

# Fit the GridSearchCV model
grid_search.fit(X_train, y_train)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train the GridSearchCV model on `X_train` and `y_train`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Add the magic function `%%time` to keep track of the amount of time it takes to fit the model and display this information once execution has completed. Remember that this code must be the first line in the cell.

</details>

Finally, obtain the optimal parameters.

In [None]:
# Obtain optimal parameters
optimal_params = grid_search.best_params_

# Display the optimal parameters
print("Optimal hyperparameters: ", optimal_params)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `best_params_` attribute to obtain the optimal values for the hyperparameters from the GridSearchCV model.

</details>

## **Step 4: Results and evaluation** 

Use the selected model to predict on your test data. Use the optimal parameters found via GridSearchCV.

In [None]:
# Instantiate the RandomForestClassifier with the optimal parameters
rf_opt = RandomForestClassifier(random_state=42, **optimal_params)

# Fit the model with the training data
rf_opt.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_opt.predict(X_test)

# Display the predictions (optional)
print("Predictions on test data: ", y_pred)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results and passing in the optimal hyperparameters found in the previous step. To distinguish this from the previous random forest model, consider naming this variable `rf_opt`.

</details>

Once again, fit the optimal model.

In [None]:
# Fit the optimal model to the training data
rf_opt.fit(X_train, y_train)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train `rf_opt` on `X_train` and `y_train`.

</details>

And predict on the test set using the optimal model.

In [None]:
# Predict on the test set
y_pred = rf_opt.predict(X_test)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `predict()` function to make predictions on `X_test` using `rf_opt`. Save the predictions now (for example, as `y_pred`), to use them later for comparing to the true labels. 

</details>

### Obtain performance scores

First, get your precision score.

In [None]:
from sklearn.metrics import precision_score

# Get precision score
precision = precision_score(y_test, y_pred, pos_label="satisfied")

# Print the precision score
print(f"Precision: {precision}")

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `precision_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Then, collect the recall score.

In [None]:
from sklearn.metrics import recall_score

# Get recall score
recall = recall_score(y_test, y_pred, pos_label="satisfied")

# Print the recall score
print(f"Recall: {recall}")

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `recall_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Next, obtain your accuracy score.

In [None]:
from sklearn.metrics import accuracy_score

# Get accuracy score
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy score
print(f"Accuracy: {accuracy}")

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `accuracy_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Finally, collect your F1-score.

In [None]:
from sklearn.metrics import f1_score

# Get F1 score
f1 = f1_score(y_test, y_pred, pos_label="satisfied")

# Print the F1 score
print(f"F1 Score: {f1}")

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `f1_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

**Question:** How is the F1-score calculated?

[Write your response here. Double-click (or enter) to edit.]

**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

[Write your response here. Double-click (or enter) to edit.]



### Evaluate the model

Now that you have results, evaluate the model. 

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

[Write your response here. Double-click (or enter) to edit.]

**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

[Write your response here. Double-click (or enter) to edit.]

Calculate the scores: precision score, recall score, accuracy score, F1 score.

In [None]:
precision = precision_score(y_test, y_pred, pos_label="satisfied")
print(f"Precision Score: {precision}")

In [None]:
# Recall score on test data set
recall = recall_score(y_test, y_pred, pos_label="satisfied")
print(f"Recall Score: {recall}")

In [None]:
# Accuracy score on test data set
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy}")

In [None]:
# F1 score on test data set
f1 = f1_score(y_test, y_pred, pos_label="satisfied")
print(f"F1 Score: {f1}")

**Question:** How does this model perform based on the four scores?

[Write your response here. Double-click (or enter) to edit.]

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [None]:
import pandas as pd

# Create a dictionary to hold the evaluation metrics for the model
metrics = {
    "Precision": [precision],
    "Recall": [recall],
    "Accuracy": [accuracy],
    "F1 Score": [f1]
}

# Create a DataFrame from the dictionary
results_table = pd.DataFrame(metrics)

# Display the results table
print(results_table)


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Build a table to compare the performance of the models. Create a DataFrame using the `pd.DataFrame()` function.

</details>

**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

[Write your response here. Double-click (or enter) to edit.]



## **Considerations**


**What are the key takeaways from this lab? Consider important steps when building a model, most effective approaches and tools, and overall results.**

[Write your response here. Double-click (or enter) to edit.]


**What summary would you provide to stakeholders?**

[Write your response here. Double-click (or enter) to edit.]

### References

[What is the Difference Between Test and Validation Datasets?,  Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

[Decision Trees and Random Forests Neil Liberman](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)