# Random forest and gradient boosting trees

Put simple, a **random forest** is an ensemble of decision trees in which each decision tree is trained with a specific random noise. The logic behind this model is that multiple uncorrelated indivdual decision trees mixed randomly are expected to perform better as a group than they do alone.

The main idea behind **gradient boosting** is building models sequentially, where each subsequent model try to reduce the error from the previous one based on a *loss function*. Therefore, the goal is to minimize the loss function by addition of weak learners using gradient descent.

## Overview

In this notebook we will continue elaborating on decision trees. Here we will illustrate the use of Random Forest and Gradient Boosting for classification and regression models

# Libraries

In [None]:
import numpy      as np
import pandas     as pd

# pip installation of mendeleev is not up to date, so we need to install it from the git repository
# ! pip install git+https://github.com/lmmentel/mendeleev.git
import mendeleev  as mendel

import matplotlib.pyplot as plt

from sklearn                 import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble        import RandomForestRegressor
from sklearn.ensemble        import RandomForestClassifier
from sklearn.ensemble        import GradientBoostingRegressor
from sklearn.ensemble        import GradientBoostingClassifier
from sklearn.metrics         import accuracy_score
from sklearn.metrics         import mean_squared_error
from sklearn.model_selection import train_test_split

from pymatgen.core.periodic_table import Element

plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)

blue   = '#0021A5'
orange = '#FA4616'

## 1. Data for classification

We will select 47 elements that occur in the fcc, hcp, and bcc structure. The elements listed were chosen because querying them for these properties yields a dataset with no unknown values, and because they represent the three most common crystallographic structures. We then query both Pymatgen and Mendeleev to get a complete set of properties per element. We will use this data to create the features from which the model will train and test.

In [None]:
# Define the attributes that we will query from the Mendeleev and Pymatgen databases

fcc = ['Ag', 'Al', 'Au', 'Cu', 'Ir',
       'Ni', 'Pb', 'Pd', 'Pt', 'Rh',
       'Th', 'Yb']

bcc = ['Ba', 'Ca', 'Cr', 'Cs', 'Eu',
       'Fe', 'Li', 'Mn', 'Mo', 'Na',
       'Nb', 'Rb', 'Ta', 'V',  'W']

hcp = ['Be', 'Cd', 'Co', 'Dy', 'Er',
       'Gd', 'Hf', 'Ho', 'Lu', 'Mg',
       'Re', 'Ru', 'Sc', 'Tb', 'Ti',
       'Tl', 'Tm', 'Y',  'Zn', 'Zr']

query_mendeleev = ['atomic_number', 'atomic_volume',
                   'boiling_point', 'en_ghosh', 
                   'evaporation_heat', 'heat_of_formation',
                   'melting_point', 'specific_heat']

query_pymatgen  = ['atomic_mass', 'atomic_radius',
                   'electrical_resistivity', 'molar_volume',
                   'bulk_modulus', 'youngs_modulus',
                   'average_ionic_radius', 'density_of_solid',
                   'coefficient_of_linear_thermal_expansion']

elements = fcc + bcc + hcp

queries  = query_mendeleev + query_pymatgen

# randomly shuflle the elements
np.random.seed(42)
np.random.shuffle(elements)

all_attributes, all_labels = [], []

# Iterate over elements
for item in elements:
    attributes = []
    
    element = mendel.element(item)

    # Query Mendeleev
    for i in query_mendeleev:    
        attributes.append( getattr(element,i) )

    element = Element(item)

    # Query Pymatgen
    for i in query_pymatgen:
        attributes.append( getattr(element,i) )
    
    # Append queries to the list
    all_attributes.append(attributes)
    
    if (item in fcc):
        all_labels.append(0)

    elif (item in bcc):
        all_labels.append(1)

    elif (item in hcp):
        all_labels.append(2)

# Create a dataframe with the values
dataframe = pd.DataFrame(all_attributes, columns=queries)

Some of the values are not available for a reduced number of elements, so we will fill manually that information to our dataframe.

In [None]:
# Missing value for Cesium
# Ref: David R. Lide (ed), CRC Handbook of Chemistry and Physics, 84th Edition. CRC Press. Boca Raton, Florida, 2003

idx = dataframe.index[dataframe['atomic_number'] == 55]
jdx = dataframe.columns.get_loc("coefficient_of_linear_thermal_expansion")

dataframe.iloc[idx, jdx] = 0.000097 

# Missing value for Rubidium
# Ref: https://www.azom.com/article.aspx?ArticleID=1834

idx = dataframe.index[dataframe['atomic_number'] == 37]
jdx = dataframe.columns.get_loc("coefficient_of_linear_thermal_expansion")

dataframe.iloc[idx, jdx] = 0.000090 

# Missing value for Ruthenium
# Ref: https://www.webelements.com/ruthenium/thermochemistry.html

idx = dataframe.index[dataframe['atomic_number'] == 44]
jdx = dataframe.columns.get_loc("evaporation_heat")

dataframe.iloc[idx, jdx] = 595 # kJ/mol 


# Missing value for Zirconium
# Ref: https://materialsproject.org/materials/mp-131

idx = dataframe.index[dataframe['atomic_number'] == 40]
jdx = dataframe.columns.get_loc("bulk_modulus")

dataframe.iloc[idx, jdx] = 94 # GPa 

### 1.1 Preprocessing the data

- We normalize the data and randomly split it into training and testing sets.

- We have 47 elements for which the crystal structure is known and we will use 40 of these as a training set and the remaining 7 as testing set.

- We will again use the Standard Score Normalization, which subtracts the mean of the feature and divide by its standard deviation.
$$
\overline{X} = \frac{X - µ}{σ}
$$
While our model might converge without feature normalization, the resultant model would be difficult to train and would be dependent on the choice of units used in the input.

In [None]:
all_attributes = [ list(dataframe.iloc[x]) for x in range( len(all_attributes) ) ]

all_attributes = np.array(all_attributes, dtype = float)
all_labels     = np.array(all_labels,     dtype = int)

# Split data into 87% training and 13% testing

X_train, X_test, y_train, y_test = train_test_split(all_attributes, all_labels, test_size=0.13, random_state=42)

# Normalize the data

mean = np.mean(all_attributes, axis = 0)
std  = np.std(all_attributes,  axis = 0)

X_train = (X_train - mean) / std
X_test  = (X_test  - mean) / std

## 2. Random Forest Classification

The fundamental idea behind a random forest is to combine many decision trees into a single model. Each decision tree in the forest considers a random subset of features and only has access to a random set of the training data points. This increases diversity in the forest leading to more robust overall predictions. When doing a classification, where the targets are a discrete class label, the random forest algorithm takes the majority vote for the predicted class.

In [None]:
# Create the object
random_forest_classification = RandomForestClassifier()

Now that we created the object we have to optimize the hyperparameters for the model. Let's create a list with the available choices.

In [None]:
# List the hyperparameters that can be tuned
for idx, key in enumerate( random_forest_classification.get_params().keys() ):
    print(f'({idx+1:2d}): {key}')

>### Assignment
>
> Optimize `min_samples_split`, `max_depth`, and `min_samples_leaf`. Then set the classification object with those parameters

Once we determined the optimal hyperparameters, we can train the model and evaluate its performance. The following code trains the model and evaluates its performance using the testing data.

In [None]:
# Train the model
random_forest_classification.fit(X_train, y_train)

# Predict the response for training and testing dataset
predicted_train = random_forest_classification.predict(X_train)
predicted_test  = random_forest_classification.predict(X_test)

# Model Accuracy for training and testing set, how often is the classifier correct?
print(f'Training accuracy = '
      f'{accuracy_score(y_train, predicted_train):.3f}')

print(f'Testing accuracy  = '
      f'{accuracy_score(y_test, predicted_test):.3f}')

# Plot the tree
label_names = ('fcc', 'bcc', 'hcp')

fig = plt.figure(figsize=(16,8))

# Select an individual decision tree, here 0.
_ = tree.plot_tree(random_forest_classification.estimators_[0], feature_names=queries,
                   class_names = label_names, filled=True, impurity=True, rounded=True)

For ease of comparison, we can create a dataframe and collect the labels predicted by our model and the actual labels. We can then compare the two and see how well our model is doing.

In [None]:
reference = np.hstack((y_train, y_test), dtype=str)
predicted = np.hstack((predicted_train, predicted_test), dtype=str)

for i, j in zip( ['0', '1', '2'], ['fcc', 'bcc', 'hcp'] ):
    reference[reference==i] = j
    predicted[predicted==i] = j

data_dictionary = {'AtomicNumber': dataframe['atomic_number'].values,
                   'Reference': reference,
                   'Predicted': predicted,
                   'Status': np.where(reference == predicted, 'Correct', 'Incorrect')}

reference_vs_predicted = pd.DataFrame(data_dictionary)

reference_vs_predicted

## 3. Gradient Boosting Classification

We can alternatively use gradient boosting an compare with the preocious method. We will use the same training and testing

In [None]:
# Create Decision Tree classifer object
gradient_boosting_classification = GradientBoostingClassifier()

Before optimizing the hyperparameters, let's list the hyperparameters that can be tuned

In [None]:
for idx, key in enumerate( gradient_boosting_classification.get_params().keys() ):
    print(f'({idx+1:2d}): {key}')

>### Assignment
>
> Optimize `min_samples_split`, `max_depth`, `min_samples_leaf`, and `learning_rate`. Then set the classification object with those parameters

Now that we optimized the hyperparameters, we can train the model.

In [None]:
# Train Decision Tree Classifer
gradient_boosting_classification.fit(X_train, y_train)

# Predict the response for training and testing dataset
predicted_train = gradient_boosting_classification.predict(X_train)
predicted_test  = gradient_boosting_classification.predict(X_test)

# Model Accuracy for training and testing set, how often is the classifier correct?
print(f'Training accuracy = '
      f'{accuracy_score(y_train, predicted_train):.3f}')

print(f'Testing accuracy  = '
      f'{accuracy_score(y_test, predicted_test):.3f}')

## 4. Data for regression

In [None]:
# Create the reference function that generates our data
def reference_function(x):
    return np.cos(x) + 2.0*np.sin(x) + 3.0*np.cos(2.0*x)

np.random.seed(seed=5)

# Generate a data set for machine learning
x = np.linspace(0, 2, 300)
x = x + np.random.normal(0.0, 0.3, x.shape)

y = reference_function(x) + np.random.normal(0.0,1.0, x.shape)

# Split the dataset into 80% for training and 20% for testing
x = x.reshape( (-1,1) )

X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=21)

# Plot the training and testing dataset
fig,ax=plt.subplots( figsize=(8,8) )

ax.scatter(X_train, y_train, c=blue, label='Training')
ax.scatter(X_test, y_test,   c=orange, label='Testing')

ax.set_title('Training and testing data',fontsize=20)

ax.set_xlabel('X Values',fontsize=18)
ax.set_ylabel(r'$ \cos(x)+2\sin(x)+3\cos(2x)$',fontsize=18)

plt.legend(loc='best', fontsize=18)

plt.show()

## 5. Random Forest Regression

Contrary to the classification task, the prediction of a continuous variable is computed for the average of all the individual decision tree estimates.

In [None]:
# Create the object
random_forest_regression = RandomForestRegressor()

Before continuing, we will optimize the hyperparameters of the random forest regression model using a grid search. Let's list our choices

In [None]:
# List the hyperparameters that can be tuned
for idx, key in enumerate( random_forest_regression.get_params().keys() ):
    print(f'({idx+1:2d}): {key}')

>### Assignment
>
> Optimize `min_impurity_decrease`, `min_samples_split`, `max_depth`, `min_samples_leaf`, and `max_leaf_nodes`. Then set the regression object with those parameters

Now we train our model and evaluate its performance using the RMSE metric.

In [None]:
# Tranin the optimized regression model
random_forest_regression.fit(X_train, y_train)

# Ccoefficient of determination for the prediction
print(f'Training score = '
      f'{random_forest_regression.score(X_train,y_train):.3f}')

print(f'Testing  score = '
      f'{random_forest_regression.score(X_test,y_test):.3f}\n')

predicted_train = random_forest_regression.predict(X_train)
predicted_test = random_forest_regression.predict(X_test)

training_rmse = np.sqrt( mean_squared_error(y_train, predicted_train) )
testing_rmse = np.sqrt( mean_squared_error(y_test, predicted_test) )
    
print(f'Training RMSE = {training_rmse:.3f}')
print(f'Testing  RMSE = {testing_rmse:.3f}')

Let's visualize our model

In [None]:
# create a series of sampling points to plot the model
points  = 1000

X_model = np.linspace(np.min(x), np.max(x), num=points)
X_model = X_model.reshape( (-1,1) )

y_model_predictions = random_forest_regression.predict(X_model)
y_model_reference   = reference_function(X_model)

# Plot the dataset
fig,ax=plt.subplots( figsize=(16,8) )

ax.scatter(X_train, y_train, c=blue, label='Data')
ax.scatter(X_test, y_test, c=orange, label='Testing')

ax.plot(X_model, y_model_predictions, c=blue, lw=2, label='Model')
ax.plot(X_model, y_model_reference,   c='k', lw=4, label='Reference')

ax.set_title('Performance', fontsize=20)

ax.set_xlabel('x values', fontsize=18)
ax.set_ylabel('y values', fontsize=18)

ax.legend(loc='best', fontsize=18)

plt.show()

In [None]:
fig,ax=plt.subplots( figsize=(8,8) )

ax.scatter(y_test, predicted_test, c=orange, label='Testing')
ax.scatter(y_train, predicted_train, c=blue, label='Training')

ax.set_xlabel('Reference', fontsize=18)
ax.set_ylabel('Prediction', fontsize=18)

ax.legend(loc='best', fontsize=18)

plt.show()

## 6. Gradient Boosting Regression

In [None]:
# Create the object
gradient_boosting_regression = GradientBoostingRegressor()

Once again, we must optimize the hyperparameters. But first, list our different choices

In [None]:
# List the hyperparameters that can be tuned
for idx, key in enumerate( gradient_boosting_regression.get_params().keys() ):
    print(f'({idx+1:2d}): {key}')

The `learning_rate` parameter controls the step size at which the model updates the predictions at each boosting stage. It scales the contribution of each new tree, effectively determining how much influence each tree has on the final prediction.

Default value: 0.1

Effect:
  - Lower values (e.g., 0.01): Slower learning, requires more trees to achieve the same performance, but improves generalization.
  - Higher values (e.g., 0.5 or 1.0): Faster learning, but can lead to overfitting if too large.

The `n_estimators` parameter defines the number of boosting stages (trees) to be used in the ensemble. Each tree corrects the residuals of the previous ones to improve prediction accuracy.

Default value: 100

Effect:
  - Higher values (e.g., 500, 1000): Improve performance but increase training time and risk overfitting (if not regularized).
  - Lower values (e.g., 50, 100): Faster training but may lead to underfitting.

Trade-off between `learning_rate` and `n_estimators:`
  - A smaller learning_rate (e.g., 0.01) often requires a larger n_estimators (e.g., 500 or more).
  - A larger learning_rate (e.g., 0.1 or 0.2) can work well with a smaller n_estimators (e.g., 100 to 200).
  - Typically, lower learning rates (e.g., 0.01 to 0.1) combined with more trees (n_estimators) lead to better generalization.

>### Assignment
>
> Optimize `learning_rate` and `n_estimators`. Then set the regression object with those parameters

Again, we proceed to train our model with the optimized hyperparameters

In [None]:
# Tranin the optimized regression model
gradient_boosting_regression.fit(X_train, y_train)

# Ccoefficient of determination for the prediction
print(f'Training score = {gradient_boosting_regression.score(X_train,y_train):.3f}')
print(f'Testing  score = {gradient_boosting_regression.score(X_test,y_test):.3f}\n')

predicted_train = gradient_boosting_regression.predict(X_train)
predicted_test = gradient_boosting_regression.predict(X_test)

training_rmse = np.sqrt( mean_squared_error(y_train, predicted_train) )
testing_rmse = np.sqrt( mean_squared_error(y_test, predicted_test) )
    
print(f'Training RMSE = {training_rmse:.3f}')
print(f'Testing  RMSE = {testing_rmse:.3f}')

Finally, we can plot our model to see the results

In [None]:
# create a series of sampling points to plot the model
points  = 1000

X_model = np.linspace(np.min(x), np.max(x), num=points)
X_model = X_model.reshape( (-1,1) )

y_model_predictions = gradient_boosting_regression.predict(X_model)
y_model_reference   = reference_function(X_model)

# Plot the dataset
fig,ax=plt.subplots( figsize=(16,8) )

ax.scatter(X_train, y_train, c=blue, label='Data')
ax.scatter(X_test, y_test, c=orange, label='Testing')

ax.plot(X_model, y_model_predictions, c=blue, lw=2, label='Model')
ax.plot(X_model, y_model_reference,   c='k', lw=4, label='Reference')

ax.set_title('Performance', fontsize=20)

ax.set_xlabel('x values', fontsize=18)
ax.set_ylabel('y values', fontsize=18)

ax.legend(loc='best', fontsize=18)

plt.show()

In [None]:
fig,ax=plt.subplots( figsize=(8,8) )

ax.scatter(y_test, predicted_test, c=orange, label='Testing')
ax.scatter(y_train, predicted_train, c=blue, label='Training')

ax.set_xlabel('Reference', fontsize=18)
ax.set_ylabel('Prediction', fontsize=18)

ax.legend(loc='best', fontsize=18)

plt.show()