# Hands-On Activity: Wine Quality Prediction using Random Forest Regression
--- 
### Introduction:

In this hands-on activity, you will work with the Wine Quality dataset to build a Random Forest Regression model to predict the quality of wine based on various features. The quality score ranges from 0 to 10, with higher scores indicating better wine quality.

### Task Overview:

1. **Data Loading and Exploration:**
   - Import the necessary libraries (already provided in the code).
   - Load the Wine Quality dataset using `load_wine()` and create a Pandas DataFrame.
   - Inspect the first few rows of the DataFrame to understand the data.

2. **Dataset Information:**
   - Check the information about the dataset using `df.info()`.

3. **Data Splitting and Standardization:**
   - Separate features (X) and the target variable (y).
   - Split the data into training and testing sets (use `train_test_split` with a test size of 20% and `random_state` of 42).
   - Standardize the features using `StandardScaler`.

4. **Random Forest Regression:**
   - Define a Random Forest Regressor model with `random_state` set to 42.
   - Perform hyperparameter tuning using `GridSearchCV` with the following parameter grid:
     - `n_estimators`: [50, 100, 200]
     - `max_depth`: [None, 10, 20]

5. **Model Evaluation:**
   - Print the best hyperparameters obtained from the grid search.
   - Use the best model to make predictions on the test set.
   - Evaluate the model performance using the R-squared (`r2_score`) on the test set.
   - Display the results.

### Additional Guidelines:
- Follow the provided code structure and fill in the necessary code to complete each task.

Enjoy the hands-on experience, and good luck with your wine quality prediction!

In [11]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV


In [12]:
# Load the Wine Quality dataset
wine_data = load_wine()

# Create a Pandas DataFrame from the dataset
df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

# Add the target values to the DataFrame as a new column
df["target"] = wine_data.target

# Inspect the first few rows of the DataFrame
df.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [13]:
# Check the information about the dataset
df.info()

# Separate features (X) and target variable (y)
X = df.drop("target", axis=1)
y = df["target"]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

In [14]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Standardize the features using StandardScaler
numerical_columns = X_train.select_dtypes(include=['float64', 'int64']).columns

# Create a StandardScaler object
scaler = StandardScaler()

# Apply standard scaling to selected columns
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.fit_transform(X_test[numerical_columns])
# Display the scaled dataset
X_train


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
158,1.665293,-0.608406,1.218962,1.605400,-0.167384,0.804002,-0.691678,1.267226,1.877540,3.419473,-1.656329,-0.879409,-0.248606
137,-0.549525,2.751541,1.003315,1.605400,-0.304379,-0.785384,-1.401233,2.049600,-0.873505,-0.024801,-0.584633,-1.254621,-0.729922
98,-0.745310,-1.143541,-0.937507,-0.282704,-0.852357,1.937029,1.746791,-1.001659,0.587987,-0.240068,0.358460,0.246227,-0.248606
159,0.612948,-0.617179,1.003315,0.879206,-0.783860,0.489272,-0.901547,1.188988,1.172585,2.881305,-1.656329,-1.129550,-0.381383
38,0.111249,-0.766315,-0.937507,-1.154137,-0.167384,0.174542,0.637487,-0.688710,-0.409266,-0.584496,0.958609,0.135053,0.946386
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,1.077938,-0.757542,1.111138,1.605400,-0.989352,1.040049,0.857349,-1.236371,0.450435,-0.722267,1.730230,0.788199,-1.078462
106,-0.892149,-0.564542,-0.865625,-0.137465,-1.400336,-1.005695,0.027870,0.015427,0.037778,-0.713656,0.186988,0.802096,-0.746519
14,1.714239,-0.441724,0.068845,-2.170809,0.106605,1.590826,1.636860,-0.610472,2.324585,1.051535,1.044345,0.565852,2.695722
92,-0.353740,-0.739996,-0.362449,0.356346,-1.400336,-1.430580,-0.541773,1.658413,0.020584,-0.864343,0.015517,-0.740442,-0.796311


In [16]:
# # Define the Random Forest Regressor model with random_state
# rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)

# # Hyperparameter tuning with GridSearchCV
# param_grid = {
#     'n_estimators': [50, 100, 200, 300, 500],  # Specify a list of values to try for the number of trees
#     'max_depth': [None, 10, 20]  # Specify the maximum depth of the trees or use None for unlimited depth
# }

# grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, scoring='r2', cv=5, n_jobs=-1)
# grid_search.fit(X_train, y_train)

In [17]:
from skopt import BayesSearchCV
from skopt.space import Real, Integer

# Define the Random Forest Regressor model with random_state
rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)

# Define the search space for hyperparameters
param_space = {
    'n_estimators': Integer(50, 500),  # Number of trees
    'max_depth': Integer(1, 20)  # Maximum depth of the trees
}

# Perform Bayesian optimization
bayes_search = BayesSearchCV(
    estimator=rf_model,
    search_spaces=param_space,
    scoring='r2',
    cv=5,
    n_jobs=-1
)

# Fit the model
bayes_search.fit(X_train, y_train)


In [None]:
# # Best hyperparameters
# best_params = grid_search.best_params_
# best_rf_model = grid_search.best_estimator_

# # Predictions on the test set
# y_pred = best_rf_model.predict(X_test)

# # Model evaluation
# r2 = r2_score(y_test,y_pred)

# # Display the results
# print(f"\nBest Hyperparameters: {best_params}")
# print(f"R-squared (r2) on Test Set: {r2:.4f}")

In [None]:
# Best hyperparameters
best_params = bayes_search.best_params_
best_rf_model = bayes_search.best_estimator_

# Predictions on the test set
y_pred = best_rf_model.predict(X_test)

# Model evaluation
r2 = r2_score(y_test,y_pred)

# Display the results
print(f"\nBest Hyperparameters: {best_params}")
print(f"R-squared (r2) on Test Set: {r2:.4f}")


Best Hyperparameters: OrderedDict({'max_depth': 10, 'n_estimators': 292})
R-squared (r2) on Test Set: 0.9415
