# Challenge 7 - Fight Fire with Data
## Random Forest Model to Predict Fire Spread

The user will be using a Jupyter Notebook to run code that was developed in Python. First, the user will check to see if the wind speed and brightness are correlated with the speed that the fire spreads derived from the satellite data. The input data has been prepared for you. Next, the user will run the code that creates a model (random forest) using the features they select (windspeed and brightness) as the inputs and estimates the speed of spread as the target variable (speed of spread). They will train a model, record the Mean Absolute Error and save the model into a deployable format also known as Predictive Model Markup Language (PMML). 

## Install and Load Packages

In [None]:
import pandas as pd
import numpy as np
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
import types

## Get and View Data

In [None]:
df = pd.read_csv("Challenge_7_Merged_Data_single_fire.csv" , low_memory=False)
df.head()

In [None]:
print(f'Dataframe shape: {df.shape}\n')
print(f'Columns: {df.columns}')
df.head()

In [None]:
features_short = [
       'WindSpeedMph', 
       'SurfaceWindGustsMph',
       'ZeroToTenLiquidSoilMoisturePercent',
       'TenToFortyLiquidSoilMoisturePercent',
       'FortyToOneHundredLiquidSoilMoisturePercent',
       'SurfaceTemperatureFahrenheit', 
       'SurfaceDewpointTemperatureFahrenheit',
       'SurfaceWetBulbTemperatureFahrenheit', 
       'RelativeHumidityPercent',
       'brightness', 
       'bright_t31', 
       'frp', 
       'speed_mph'] 

# preview our df
print('Display df')
display(df[features_short].head())

# look at statistics of df
print('Describe dataframe')
display(df[features_short].describe())

In [None]:
# choose features
input_features = features_short[:-1]

## These are the features that we will put in the model

In [None]:
input_features = [
       'WindSpeedMph', 
         'SurfaceWindGustsMph',
         'ZeroToTenLiquidSoilMoisturePercent',
         'TenToFortyLiquidSoilMoisturePercent',
         'FortyToOneHundredLiquidSoilMoisturePercent',
         'SurfaceTemperatureFahrenheit', 
         'SurfaceDewpointTemperatureFahrenheit',
         'SurfaceWetBulbTemperatureFahrenheit', 
       'RelativeHumidityPercent',
       'brightness', 
       'bright_t31', 
       'frp' 
]

In [None]:
y = np.array(df['speed_mph'])
X = np.array(df[input_features])
print(y.shape)
print(X.shape)

## Make a train/test split for the model

In [None]:
# make train test split
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size = 0.25, random_state = 137)
train_features.shape
test_features.shape
train_labels.shape
test_labels.shape

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

## Train and test a random forest model using Kfold Validation
Here we're going to split our training data into three-folds. For each round, two folds will be used for training, and one fold will be used for validation. 

In [None]:
import time
from sklearn.model_selection import KFold

# Instantiate model with 100 decision trees with a depth of 2
rf = RandomForestRegressor(
    n_estimators = 1000,
    max_depth = 12,
    n_jobs= -1, 
    random_state = 137,
    verbose=1
    )

# Set up cross validation
kf = KFold(n_splits=3, shuffle=True, random_state=8)

# Track start time
start_time = time.time()
# Keep track of MAE for each fit
all_mae = []
for train_index, test_index in kf.split(train_features):
    X_train, X_test = train_features[train_index], train_features[test_index]
    y_train, y_test = train_labels[train_index], train_labels[test_index]
    
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    errors = (abs(predictions - y_test))
    mae = np.mean(errors)
    all_mae.append(mae)
    
print("--- %s seconds ---" % (time.time() - start_time))

## Display accuracy of the model
Let's check on how the model did on the training data. 

In [None]:
(f'Average Random Forest Mean Absolute Error over three folds: {np.mean(all_mae)}') # THIS IS OUR VERIFICATION CODE(0.034933)

MAE: is it the same as yours?  
Mean Absolute Error: 0.0034933 mph.

## Export Predictive Model Markup Language File.

Although we haven't officialy tested the model on the test data, lets save it. 

https://collaborate.pega.com/discussion/creating-pmml-python-r-and-pega

In [None]:
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

# instantiate PMMLPipeline object
pipeline = PMMLPipeline([
        ('random_forest', rf)])

# train
pipeline.fit(train_features, train_labels)

# save
sklearn2pmml(pipeline, "randomforest.pmml", with_repr = True)

# Complete Challenge

In [None]:
# Verification 

import ww
ww = ww.WatsonWarriors()
 
ww.answer(0, np.mean(all_mae))

In [None]:
## Paste validation code below.
