In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
# Load the dataset
file_path = 'cleaned_dataset.csv'  # Replace with your file path
data = pd.read_csv(file_path)

# Inspect the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48492 entries, 0 to 48491
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Coconut_ID           48492 non-null  object 
 1   Molecular_Formula    48492 non-null  object 
 2   Molecular_Weight     48492 non-null  float64
 3   Number_of_Nitrogens  48492 non-null  int64  
 4   Number_of_Oxygens    48492 non-null  int64  
 5   Number_of_Carbons    48492 non-null  int64  
 6   Total_Atom_Number    48492 non-null  int64  
 7   Bond_Count           48492 non-null  int64  
 8   ALogP                48492 non-null  float64
 9   APolarSurfaceArea    48492 non-null  float64
 10  Topo_PSA             48492 non-null  float64
dtypes: float64(4), int64(5), object(2)
memory usage: 4.1+ MB


In [3]:
data.head()

Unnamed: 0,Coconut_ID,Molecular_Formula,Molecular_Weight,Number_of_Nitrogens,Number_of_Oxygens,Number_of_Carbons,Total_Atom_Number,Bond_Count,ALogP,APolarSurfaceArea,Topo_PSA
0,CNP0000002,C27H36N2O15S,660.651,2,15,27,81,47,-2.0821,260.03,260.03
1,CNP0000003,C34H30O10,598.604,0,10,34,74,50,3.63422,137.82,137.82
2,CNP0000004,C32H26O9,554.551,0,9,32,67,47,3.32262,139.59,139.59
3,CNP0000005,C33H42O6,534.693,0,6,33,81,42,6.8794,78.9,78.9
4,CNP0000006,C31H24O9,540.524,0,9,31,64,46,3.01962,150.59,150.59


In [4]:
X = data.drop(columns=['Coconut_ID', 'Molecular_Formula', 'ALogP'])  # Drop non-informative columns and target
y = data['ALogP']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# Define the preprocessor with placeholder scaler
preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), X.columns)  # Scaler will be dynamically selected
    ]
)

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),  # Placeholder preprocessor
    ('model', RandomForestRegressor(random_state=42))  # RandomForestRegressor as the model
])

## What is a Pipeline in Machine Learning?

In machine learning, a pipeline is a way of organizing the steps in a workflow so that they can be applied in sequence. It ensures that each step, like preparing the data or applying a model, happens automatically and in the correct order. You can think of a pipeline like an assembly line where each station processes the product before it moves to the next station.

## What’s Happening in This Code?

### Preprocessor:

The preprocessor is like the "clean-up crew" that prepares the data before it goes into the machine learning model.

Here, the preprocessor is using something called a ColumnTransformer. It applies a transformation to the data columns. In this case, it is using a scaler (specifically StandardScaler) to scale the data. Scaling means adjusting the data so it has a mean of 0 and a standard deviation of 1. This is important because some models (like RandomForestRegressor) work better when the data is on a similar scale.

The StandardScaler() will be applied to all the columns of the data (X.columns), which is dynamically chosen based on the dataset you use.

### Pipeline:

A pipeline is created with two main steps:

Preprocessing: First, the data will go through the preprocessor (where the scaling happens).

Model: After preprocessing, the processed data will go into a RandomForestRegressor, which is a type of machine learning model used for predicting continuous values (like predicting prices or quantities). The random_state=42 ensures that the model gives consistent results when run multiple times (it helps with reproducibility).

## Why Do We Use This?
Preprocessing ensures that the data is ready for the machine learning model. Raw data often needs cleaning and scaling to help the model perform better.
The pipeline ties the whole process together, so that data is processed and then fed into the model in one go. This makes the code more efficient, easier to maintain, and helps avoid errors in processing steps.

In [11]:
# Define parameter grid
param_grid = {
    # Scaler selection
    'preprocessor__scaler': [StandardScaler(), MinMaxScaler()],
    
    # RandomForest parameters
    'model__n_estimators': [50, 100, 150],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5],
    'model__min_samples_leaf': [1, 2]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

# Perform grid search
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print(f"Best Parameters: {best_params}")

Fitting 3 folds for each of 72 candidates, totalling 216 fits


9 fits failed out of a total of 216.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\admin\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\admin\anaconda3\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\admin\anaconda3\Lib\site-packages\sklearn\base.py", lin

Best Parameters: {'model__max_depth': 20, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__n_estimators': 150, 'preprocessor__scaler': StandardScaler()}


## What is Grid Search?

1. Grid Search is a method for trying different combinations of parameters (settings) for a machine learning model. The goal is to find the best combination that results in the most accurate model.
2. Think of it like trying different recipes for a cake, where each recipe represents a different combination of ingredients (parameters) to see which one makes the best cake (model).

## Defining the Parameter Grid (param_grid)
The parameter grid defines all the combinations of parameters that we want to test. Here’s how it’s structured:

Scaler Selection: The first setting we want to test is the type of scaling we apply to the data. We can either use:

StandardScaler: Scales the data so that it has a mean of 0 and a standard deviation of 1.

MinMaxScaler: Scales the data to a range between 0 and 1.

Random Forest Parameters: These are the settings for the RandomForestRegressor model that we are using to make predictions:

n_estimators: The number of trees in the random forest. We are testing with 50, 100, and 150 trees.

max_depth: The maximum depth (or levels) that each tree can have. It controls how deep each tree can grow. We test with None (no limit), 10, and 20.

min_samples_split: The minimum number of samples required to split a node into two. We are testing values of 2 and 5.

min_samples_leaf: The minimum number of samples required to be at a leaf node. We test values of 1 and 2.

## Setting Up GridSearchCV
GridSearchCV is used to perform the grid search. It takes in the pipeline (which includes preprocessing and the model), the parameter grid we defined above, and some additional settings:

scoring: We use neg_mean_squared_error to evaluate the model’s performance based on the error in predictions (lower is better).

cv=3: This means the data will be split into 3 parts (cross-validation) to check how well the model performs on different data splits.

n_jobs=-1: This tells GridSearchCV to use all available processors to speed up the process.

verbose=2: This option provides more detailed information about what’s happening during the grid search.

## Running the Grid Search

grid_search.fit(X_train, y_train): This command starts the grid search. It tests all the different combinations of parameters and evaluates how well each one works using the training data (X_train and y_train).

## Getting the Best Model and Parameters

best_params = grid_search.best_params_: This gives us the best combination of settings (parameters) that resulted in the best performance.

best_model = grid_search.best_estimator_: This gives us the best model that was found during the grid search, with all the optimal settings.

## The best parameters found by the grid search are:

### model__max_depth: 20:

The optimal maximum depth for the trees in the Random Forest model is 20. This means the trees are allowed to grow up to 20 levels deep, which helps in capturing complex patterns in the data without being too shallow (which could miss important information) or too deep (which could lead to overfitting).

### model__min_samples_leaf: 1:

The model performs best when there is no restriction on the minimum number of samples required at each leaf node (i.e., each leaf can contain just one sample). This allows the trees to capture finer details in the data, though it might also increase the risk of overfitting, especially if the dataset is noisy.

### model__min_samples_split: 2:

The optimal value for minimum samples required to split a node is 2, meaning the model can split a node even if it only has 2 samples. This enables the model to make more splits, potentially capturing more granular patterns in the data, but can also lead to overfitting if the tree becomes too specific to the training data.

### model__n_estimators: 150:

The model works best when there are 150 trees in the forest. More trees generally improve the model’s performance, but too many can make the model more computationally expensive. In this case, 150 trees provide a good balance between performance and computational efficiency.

### preprocessor__scaler: StandardScaler():

The StandardScaler is the optimal choice for scaling the data. This scaler standardizes the features to have a mean of 0 and a standard deviation of 1, which helps improve the performance of many machine learning models, especially when features have different units or ranges.

## Overall Inference:
The Random Forest model with 150 trees, a maximum depth of 20, and the ability to make splits with as few as 2 samples at each node will likely provide a good balance between capturing complex patterns and avoiding overfitting.

Standard scaling of the data is also beneficial, which means the model benefits from data where features are adjusted to a common scale.

The combination of these settings suggests the model is ready to generalize well to unseen data, with enough complexity to model the relationships in the data without overfitting too much.

In [12]:
y_pred = best_model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R²): {r2}")

Mean Absolute Error (MAE): 0.5318305749217103
Mean Squared Error (MSE): 0.7095168646858466
Root Mean Squared Error (RMSE): 0.842328240465584
R-squared (R²): 0.9346997669185286


## Inference from the Evaluation Metrics:

### R-squared (R²) = 0.9347:

The model explains 93.47% of the variance in ALogP, which is a strong result. It indicates that the model fits the data well and is able to predict ALogP with high accuracy.

### Mean Absolute Error (MAE) = 0.5318:

The average error per prediction is about 0.53. Given that ALogP has a range of 70+, this error seems reasonable, suggesting that the model is performing well in terms of predicting the target values.

### Mean Squared Error (MSE) = 0.7095:

The MSE is slightly higher than MAE due to the squaring of errors, which penalizes larger deviations more. The value of 0.7095 is consistent with the high R² score, suggesting that large errors are not prevalent in the model’s predictions.

### Root Mean Squared Error (RMSE) = 0.8423:

The RMSE of 0.84 shows that the model has a relatively low average error per prediction, considering the scale of ALogP. It indicates that the model is making fairly accurate predictions on average.

## Overall Inference:
The model demonstrates strong performance with high accuracy (R² of 93.47%) and low prediction errors (MAE = 0.53, RMSE = 0.84). Given the range of ALogP, these errors are acceptable and suggest the model is reliable for predicting ALogP values. The model generalizes well to unseen data and is performing efficiently.