## Random Forest Example
For this notebook I'll be pulling some data from Materials Project. I'll use the old api using my MyPymatgen virtual environment

Let's start by getting our API key loaded.

In [10]:
import pandas as pd
from pymatgen.ext.matproj import MPRester
import os

filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)


Sparks_API = get_file_contents(filename)
mpr = MPRester(Sparks_API)



Now let's grab some data to work with. We'll grab chlorides within 1 meV of the convex hull.

In [11]:
df = pd.DataFrame(columns=('pretty_formula', 'band_gap',
                           "density", 'formation_energy_per_atom', 'volume'))

# grab some props for stable chlorides
criteria = {'e_above_hull': {'$lte': 0.001},'elements':{'$all':['Cl']}}
# criteria2 = {'e_above_hull': {'$lte': 0.02},'elements':{'$all':['O']},
#              'band_gap':{'$ne':0}}

props = ['pretty_formula', 'band_gap', "density",
         'formation_energy_per_atom', 'volume']
entries = mpr.query(criteria=criteria, properties=props)

i = 0
for entry in entries:
    df.loc[i] = [entry['pretty_formula'], entry['band_gap'], entry['density'],
                 entry['formation_energy_per_atom'], entry['volume']]
    i += 1

100%|██████████| 10611/10611 [00:06<00:00, 1651.05it/s]


In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_classification
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['band_gap','formation_energy_per_atom','volume']]
y = df['density']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)
df

the r2 score is 0.2881470805614287
the mean absolute error is 1.2284514037968903




Unnamed: 0,pretty_formula,band_gap,density,formation_energy_per_atom,volume
0,BaCaB2O5,4.4757,3.617093,-3.216147,1024.757146
1,BaNaLi3(BO2)6,5.2185,2.902622,-2.854216,501.142507
2,CsCa10(PO4)7,5.2036,3.221387,-3.294669,1235.573909
3,EuZn(BO2)5,0.0000,4.474836,-2.824396,640.374866
4,BaEu2O4,0.0000,7.087849,-3.089818,473.481665
...,...,...,...,...,...
10606,Li6Mn5Fe(BO3)6,2.6100,3.190292,-2.510023,377.379651
10607,Rb4PbO4,1.6016,4.577572,-1.533193,889.576219
10608,Li6Mn5Fe(BO3)6,2.6056,3.190778,-2.510041,377.322178
10609,SrTeO3,3.6197,4.907403,-2.387443,356.265104


Our model isn't too great alone, but what if we add CBFV features? 

In [13]:
from CBFV import composition
import time

rename_dict = {'density': 'target', 'pretty_formula':'formula'}
df = df.rename(columns=rename_dict)


RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['formula','band_gap','formation_energy_per_atom','volume']]
y = df['target']



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


#technically we should scale and normalize our data here... but lets skip it for now
# Start the timer
start_time = time.time()

# Calculate the training time
training_time = time.time() - start_time

rf = RandomForestRegressor(max_depth=4, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)
print("Training time:", training_time, "seconds")


Processing Input Data:   0%|          | 0/10611 [00:00<?, ?it/s]

Processing Input Data: 100%|██████████| 10611/10611 [00:01<00:00, 5967.15it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 10611/10611 [00:02<00:00, 4422.16it/s]



NOTE: Your data contains formula with exotic elements. These were skipped.
	Creating Pandas Objects...


Processing Input Data: 100%|██████████| 10611/10611 [00:02<00:00, 4693.48it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 10611/10611 [00:01<00:00, 5466.37it/s]



NOTE: Your data contains formula with exotic elements. These were skipped.
	Creating Pandas Objects...
the r2 score is 0.8654756838114049
the mean absolute error is 0.4925598130646727
Training time: 0.0 seconds




Way better! Our R^2 went way up, and our MAE went way down.

# Grid Search Hyperparameter Tuning

Now let's try one more time, but this time we'll do hyperparameter tuning!

In [14]:
from sklearn.model_selection import GridSearchCV
import time
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf = RandomForestRegressor(random_state=0)

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

# Start the timer
start_time = time.time()

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Train the model with the best parameters
rf_best = RandomForestRegressor(random_state=0, **best_params)
rf_best.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_best.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")


KeyboardInterrupt: 

# Random search hyperparameter tuning
Now let's try random search hyperparameter tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import time

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf = RandomForestRegressor(random_state=0)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, random_state=0)

# Start the timer
start_time = time.time()

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Get the best parameters and best score
best_params = random_search.best_params_
best_score = random_search.best_score_

# Train the model with the best parameters
rf_best = RandomForestRegressor(random_state=0, **best_params)
rf_best.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_best.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")


# Tree visualization
Finally, we can do tree visualization with graphviz

In [None]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
feature_list = list(X_train.columns)
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

# XGBoost and others

There are many other variants of random forests. Adaboost, XGBoost etc. XGBoost, in particular, has emerged as one of the favorite and best classical models