## Random Forest Example
For this notebook I'll be pulling some data from Materials Project. I'll use the old api using my MyPymatgen virtual environment

Let's start by getting our API key loaded.

In [None]:
import pandas as pd
from pymatgen.ext.matproj import MPRester
import os

filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)


Sparks_API = get_file_contents(filename)
mpr = MPRester(Sparks_API)

Now let's grab some data to work with. We'll grab chlorides within 1 meV of the convex hull.

In [None]:
df = pd.DataFrame(columns=('pretty_formula', 'band_gap',
                           "density", 'formation_energy_per_atom', 'volume'))

# grab some props for stable chlorides
criteria = {'e_above_hull': {'$lte': 0.001},'elements':{'$all':['Cl']}}
# criteria2 = {'e_above_hull': {'$lte': 0.02},'elements':{'$all':['O']},
#              'band_gap':{'$ne':0}}

props = ['pretty_formula', 'band_gap', "density",
         'formation_energy_per_atom', 'volume']
entries = mpr.query(criteria=criteria, properties=props)

i = 0
for entry in entries:
    df.loc[i] = [entry['pretty_formula'], entry['band_gap'], entry['density'],
                 entry['formation_energy_per_atom'], entry['volume']]
    i += 1

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_classification
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['band_gap','formation_energy_per_atom','volume']]
y = df['density']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)
df

Our model isn't too great alone, but what if we add CBFV features? 

In [None]:
from CBFV import composition
import time

rename_dict = {'density': 'target', 'pretty_formula':'formula'}
df = df.rename(columns=rename_dict)


RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['formula','band_gap','formation_energy_per_atom','volume']]
y = df['target']



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


#technically we should scale and normalize our data here... but lets skip it for now
# Start the timer
start_time = time.time()

# Calculate the training time
training_time = time.time() - start_time

rf = RandomForestRegressor(max_depth=4, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)
print("Training time:", training_time, "seconds")


Way better! Our R^2 went way up, and our MAE went way down.

# Grid Search Hyperparameter Tuning

Now let's try one more time, but this time we'll do hyperparameter tuning!
Note- we're going to reduce our data down to just 300 points during hyperparameter tuning or it will take foreeeeeeevvvvveeerrr.

In [None]:
from sklearn.model_selection import GridSearchCV
import time

# Select a subset of the dataframe with 300 data points
subset_df = df.sample(n=300, random_state=RNG_SEED)

# Split the subset into training and testing sets
rename_dict = {'density': 'target', 'pretty_formula':'formula'}
subset_df = subset_df.rename(columns=rename_dict)
RNG_SEED = 42
np.random.seed(seed=RNG_SEED)
X = subset_df[['formula','band_gap','formation_energy_per_atom','volume']]
y = subset_df['target']

#now do CBFV
X_train, y_train, formulae_train, skipped_train = composition.generate_features(subset_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(subset_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf_subset = RandomForestRegressor(random_state=RNG_SEED)

# Create the GridSearchCV object
grid_search_subset = GridSearchCV(estimator=rf_subset, param_grid=param_grid, cv=5)

# Start the timer
start_time_subset = time.time()

# Fit the model to the training data
grid_search_subset.fit(X_train, y_train)

# Calculate the training time
training_time_subset = time.time() - start_time_subset

# Get the best parameters and best score
best_params_subset = grid_search_subset.best_params_
best_score_subset = grid_search_subset.best_score_

# Train the model with the best parameters
rf_best_subset_grid = RandomForestRegressor(random_state=RNG_SEED, **best_params_subset)
rf_best_subset_grid.fit(X_train, y_train)

# Predict on the test data
y_pred_subset = rf_best_subset_grid.predict(X_test)

# Evaluate the model
r2_subset = r2_score(y_test, y_pred_subset)
mae_subset = mean_absolute_error(y_test, y_pred_subset)
rmse_subset = mean_squared_error(y_test, y_pred_subset, squared=False)

print("Best parameters (subset):", best_params_subset)
print("Best score (subset):", best_score_subset)
print("R2 score (subset):", r2_subset)
print("Mean absolute error (subset):", mae_subset)
print("Root mean squared error (subset):", rmse_subset)
print("Training time (subset):", training_time_subset, "seconds")


# Random search hyperparameter tuning
Now let's try random search hyperparameter tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import time

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Create the random forest regressor
rf_subset = RandomForestRegressor(random_state=0)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_subset, param_distributions=param_grid, n_iter=10, cv=5, random_state=0)

# Start the timer
start_time_subset = time.time()

# Fit the model to the training data
grid_search_subset.fit(X_train, y_train)

# Calculate the training time
training_time_subset = time.time() - start_time_subset

# Get the best parameters and best score
best_params_subset = grid_search_subset.best_params_
best_score_subset = grid_search_subset.best_score_

# Train the model with the best parameters
rf_best_subset_random = RandomForestRegressor(random_state=RNG_SEED, **best_params_subset)
rf_best_subset_random.fit(X_train, y_train)

# Predict on the test data
y_pred_subset = rf_best_subset_random.predict(X_test)

# Evaluate the model
r2_subset = r2_score(y_test, y_pred_subset)
mae_subset = mean_absolute_error(y_test, y_pred_subset)
rmse_subset = mean_squared_error(y_test, y_pred_subset, squared=False)

print("Best parameters (subset):", best_params_subset)
print("Best score (subset):", best_score_subset)
print("R2 score (subset):", r2_subset)
print("Mean absolute error (subset):", mae_subset)
print("Root mean squared error (subset):", rmse_subset)
print("Training time (subset):", training_time_subset, "seconds")


In [None]:
#featurize with all the data, not just the subset
X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)

# Start the timer
start_time = time.time()

# Train the model with the best parameters
rf_best_subset_random.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Predict on the test data
y_pred = rf_best_subset_random.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")


# Tree visualization
Finally, we can do tree visualization with graphviz

In [None]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
feature_list = list(X_train.columns)
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

# XGBoost and others

There are many other variants of random forests. Adaboost, XGBoost etc. XGBoost, in particular, has emerged as one of the favorite and best classical models. 

# Try it yourself!

- Find oxides with MPRester and load their formula and band gap energy
- Create a CBFV with this data for use in training models 
- Create a RF model using this data and score it 
- Create a single decision tree model (using sklearn DecisionTreeClassifier) using this data and score it
- How do the scores of the two models compare? Why? 
- Extract the feature importance from the RF model and see which features matter the most

In [None]:
# Code below