<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/random_forest/RF%20example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Random Forest Example
For this notebook I'll be pulling some data from Materials Project. I'll use the old api using my MyPymatgen virtual environment

## Video 

https://www.youtube.com/watch?v=X6BXE3Bln5M&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=20 (Ensemble techniques)

Let's start by getting our API key loaded.

In [1]:
!pip install pymatgen
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/My Drive/teaching/5540-6640 Materials Informatics

'pip' is not recognized as an internal or external command,
operable program or batch file.


ModuleNotFoundError: No module named 'google'

Now let's grab our API key

In [2]:
import pandas as pd
from pymatgen.ext.matproj import MPRester
import os
#if running locally
#filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'
#if running google Colab
filename = r'old_apikey.txt'


def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)


Sparks_API = get_file_contents(filename)
mpr = MPRester(Sparks_API)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Now let's grab some data to work with. We'll grab chlorides within 1 meV of the convex hull.

In [None]:
df = pd.DataFrame(columns=('pretty_formula', 'band_gap',
                           "density", 'formation_energy_per_atom', 'volume'))

# grab some props for stable chlorides
criteria = {'e_above_hull': {'$lte': 0.001},'elements':{'$all':['Cl']}}
# criteria2 = {'e_above_hull': {'$lte': 0.02},'elements':{'$all':['O']},
#              'band_gap':{'$ne':0}}

props = ['pretty_formula', 'band_gap', "density",
         'formation_energy_per_atom', 'volume']
entries = mpr.query(criteria=criteria, properties=props)

i = 0
for entry in entries:
    df.loc[i] = [entry['pretty_formula'], entry['band_gap'], entry['density'],
                 entry['formation_energy_per_atom'], entry['volume']]
    i += 1

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_classification
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['band_gap','formation_energy_per_atom','volume']]
y = df['density']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred)
df

Our model isn't too great alone, but what if we add CBFV features? 

In [None]:
from CBFV import composition
import time

rename_dict = {'density': 'target', 'pretty_formula':'formula'}
df = df.rename(columns=rename_dict)


RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['formula','band_gap','formation_energy_per_atom','volume']]
y = df['target']



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


#technically we should scale and normalize our data here... but lets skip it for now
# Start the timer
start_time = time.time()

# Calculate the training time
training_time = time.time() - start_time

rf = RandomForestRegressor(max_depth=4, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred)
print("Training time:", training_time, "seconds")


Way better! Our R^2 went way up, and our MAE went way down.

# Grid Search Hyperparameter Tuning

Now let's try one more time, but this time we'll do hyperparameter tuning!
Note- we're going to reduce our data down to just 300 points during hyperparameter tuning or it will take foreeeeeeevvvvveeerrr.

In [None]:
from sklearn.model_selection import GridSearchCV
import time

# Select a subset of the dataframe with 300 data points
subset_df = df.sample(n=300, random_state=RNG_SEED)

# Split the subset into training and testing sets
rename_dict = {'density': 'target', 'pretty_formula':'formula'}
subset_df = subset_df.rename(columns=rename_dict)
RNG_SEED = 42
np.random.seed(seed=RNG_SEED)
X = subset_df[['formula','band_gap','formation_energy_per_atom','volume']]
y = subset_df['target']

#now do CBFV
X_train, y_train, formulae_train, skipped_train = composition.generate_features(subset_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(subset_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf_subset = RandomForestRegressor(random_state=RNG_SEED)

# Create the GridSearchCV object
grid_search_subset = GridSearchCV(estimator=rf_subset, param_grid=param_grid, cv=5)

# Start the timer
start_time_subset = time.time()

# Fit the model to the training data
grid_search_subset.fit(X_train, y_train)

# Calculate the training time
training_time_subset = time.time() - start_time_subset

# Get the best parameters and best score
best_params_subset = grid_search_subset.best_params_
best_score_subset = grid_search_subset.best_score_

# Train the model with the best parameters
rf_best_subset_grid = RandomForestRegressor(random_state=RNG_SEED, **best_params_subset)
rf_best_subset_grid.fit(X_train, y_train)

# Predict on the test data
y_pred_subset = rf_best_subset_grid.predict(X_test)

# Evaluate the model
r2_subset = r2_score(y_test, y_pred_subset)
mae_subset = mean_absolute_error(y_test, y_pred_subset)
rmse_subset = mean_squared_error(y_test, y_pred_subset)

print("Best parameters (subset):", best_params_subset)
print("Best score (subset):", best_score_subset)
print("R2 score (subset):", r2_subset)
print("Mean absolute error (subset):", mae_subset)
print("Root mean squared error (subset):", rmse_subset)
print("Training time (subset):", training_time_subset, "seconds")


# Random search hyperparameter tuning
Now let's try random search hyperparameter tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import time

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf_subset = RandomForestRegressor(random_state=0)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_subset, param_distributions=param_grid, n_iter=10, cv=5, random_state=0)

# Start the timer
start_time_subset = time.time()

# Fit the model to the training data
grid_search_subset.fit(X_train, y_train)

# Calculate the training time
training_time_subset = time.time() - start_time_subset

# Get the best parameters and best score
best_params_subset = grid_search_subset.best_params_
best_score_subset = grid_search_subset.best_score_

# Train the model with the best parameters
rf_best_subset_random = RandomForestRegressor(random_state=RNG_SEED, **best_params_subset)
rf_best_subset_random.fit(X_train, y_train)

# Predict on the test data
y_pred_subset = rf_best_subset_random.predict(X_test)

# Evaluate the model
r2_subset = r2_score(y_test, y_pred_subset)
mae_subset = mean_absolute_error(y_test, y_pred_subset)
rmse_subset = mean_squared_error(y_test, y_pred_subset)

print("Best parameters (subset):", best_params_subset)
print("Best score (subset):", best_score_subset)
print("R2 score (subset):", r2_subset)
print("Mean absolute error (subset):", mae_subset)
print("Root mean squared error (subset):", rmse_subset)
print("Training time (subset):", training_time_subset, "seconds")


In [None]:
#featurize with all the data, not just the subset
X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)

# Start the timer
start_time = time.time()

# Train the model with the best parameters
rf_best_subset_random.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Predict on the test data
y_pred = rf_best_subset_random.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)

print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")


# Tree visualization
Finally, we can do tree visualization with graphviz

In [None]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
feature_list = list(X_train.columns)
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

# XGBoost 

#### Boosting vs Bagging 

XGBoost has emerged as one of the favorite and best classical models. 
- Random Forest uses a technique called “bagging,” which builds multiple decision trees independently and then averages their predictions to reduce variance and improve robustness

- XGBoost, on the other hand, uses “boosting,” which builds trees sequentially, each new tree correcting the errors of the previous ones. This process reduces bias and improves accuracy by focusing on the more difficult parts of the data

#### Gradient Boosting

- Optimizes a loss function using a gradient descent. This minimizes errors more effectively than the random approach of Random Forest 

#### Overfitting

- XGBoost has built in regularization which helps provide robustness against overfitting


### Training

Lets train the model on the same dataset as before (for comparison's sake)


In [None]:
X_train.describe()

First we will import our libraries. We will be using the XGBoost library for our model. 

In [None]:
import xgboost as xgb
import time
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Before we fit our model we need to clean our data to make sure that the features can be used in analysis.

In [None]:

# Clean feature names by replacing invalid characters
X_train.columns = [col.replace('[', '').replace(']', '').replace('<', '').replace('>', '') for col in X_train.columns]
X_test.columns = [col.replace('[', '').replace(']', '').replace('<', '').replace('>', '') for col in X_test.columns]


Finally, we will train our model. Some of the parameters you can change include the n_estimators (the number of trees the model will train), the learning_rate (the contribution of each tree to the model), and max_depth (how deep a tree goes which may increase overfitting)

In [None]:

# Initialize and train the XGBoost model
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, seed=42, learning_rate=0.3, max_depth=6)

# Start timer
start_time = time.time()

xg_reg.fit(X_train, y_train)

# Calculate training time
training_time = time.time() - start_time

# Predict on test data
y_pred = xg_reg.predict(X_test)


Finally, lets evaluate this model. We will calculate the R^2, Mean Average Error (MAE), and Root Mean Squared Error (RMSE) values to see how it did. Furthermore, we will put the feature importance scores into a dataframe and print the top 10 features. XGBoost is very useful for discovering what features contribute the most to making good predictions. This is good for when you need to pare down the number of features in your dataset. 

In [None]:

# Evaluation
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)

print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")

# Get feature importance scores from the XGBoost model
importance_scores = xg_reg.get_booster().get_score(importance_type='weight')

# Convert the dictionary to a DataFrame
importance_df = pd.DataFrame(importance_scores.items(), columns=['Feature', 'Importance'])

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Display the sorted DataFrame
importance_df.head(10)


# Try it yourself!

- Find oxides with MPRester and load their formula and band gap energy
- Create a CBFV with this data for use in training models 
- Create a RF model using this data and score it 
- Create a single decision tree model (using sklearn DecisionTreeClassifier) using this data and score it
- How do the scores of the two models compare? Why? 
- Extract the feature importance from the RF model and see which features matter the most

In [None]:
# Code below
