# Challenge 7 - Fight Fire with Data
## Random Forest Model to Predict Fire Spread

The user will be using a Jupyter Notebook to run code that was developed in Python. First, the user will check to see if the wind speed and brightness are correlated with the speed that the fire spreads derived from the satellite data. The input data has been prepared for you. Next, the user will run the code that creates a model (random forest) using the features they select (windspeed and brightness) as the inputs and estimates the speed of spread as the target variable (speed of spread). They will train a model, record the Mean Absolute Error and save the model into a deployable format also known as Predictive Model Markup Language (PMML). 

## Install and Load Packages

In [1]:
import pandas as pd
import numpy as np
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
import types

## Get and View Data

In [3]:
df = pd.read_csv("Challenge_7_Merged_Data_single_fire.csv" , low_memory=False)
df.head()

Unnamed: 0,SiteId,latitude,longitude,DateHrGmt,DateHrLwt,WindSpeedMph,WindDirectionDegrees,SurfaceWindGustsMph,ZeroToTenLiquidSoilMoisturePercent,TenToFortyLiquidSoilMoisturePercent,...,bright_t31,frp,daynight,type,datetime_start,lat_start,long_start,distance,duration,speed_mph
0,2161142584,36.46616,-121.89671,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,309.3,77.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.0,0.0,0.0
1,2161142584,36.46486,-121.90179,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,306.9,77.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.296552,0.0,0.0
2,2161142584,36.46379,-121.89375,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,306.1,77.7,N,3,7/22/2016 20:21,36.46616,-121.89671,0.232352,0.0,0.0
3,2161142584,36.46245,-121.8989,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,342.5,77.7,N,3,7/22/2016 20:21,36.46616,-121.89671,0.284073,0.0,0.0
4,2161142584,36.46112,-121.90392,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,301.6,55.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.53142,0.0,0.0


In [4]:
print(f'Dataframe shape: {df.shape}\n')
print(f'Columns: {df.columns}')
df.head()

Dataframe shape: (13818, 35)

Columns: Index(['SiteId', 'latitude', 'longitude', 'DateHrGmt', 'DateHrLwt',
       'WindSpeedMph', 'WindDirectionDegrees', 'SurfaceWindGustsMph',
       'ZeroToTenLiquidSoilMoisturePercent',
       'TenToFortyLiquidSoilMoisturePercent',
       'FortyToOneHundredLiquidSoilMoisturePercent',
       'SurfaceTemperatureFahrenheit', 'SurfaceDewpointTemperatureFahrenheit',
       'SurfaceWetBulbTemperatureFahrenheit', 'RelativeHumidityPercent',
       'time_stamp', 'brightness', 'scan', 'track', 'acq_date', 'acq_time',
       'satellite', 'instrument', 'confidence', 'version', 'bright_t31', 'frp',
       'daynight', 'type', 'datetime_start', 'lat_start', 'long_start',
       'distance', 'duration', 'speed_mph'],
      dtype='object')


Unnamed: 0,SiteId,latitude,longitude,DateHrGmt,DateHrLwt,WindSpeedMph,WindDirectionDegrees,SurfaceWindGustsMph,ZeroToTenLiquidSoilMoisturePercent,TenToFortyLiquidSoilMoisturePercent,...,bright_t31,frp,daynight,type,datetime_start,lat_start,long_start,distance,duration,speed_mph
0,2161142584,36.46616,-121.89671,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,309.3,77.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.0,0.0,0.0
1,2161142584,36.46486,-121.90179,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,306.9,77.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.296552,0.0,0.0
2,2161142584,36.46379,-121.89375,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,306.1,77.7,N,3,7/22/2016 20:21,36.46616,-121.89671,0.232352,0.0,0.0
3,2161142584,36.46245,-121.8989,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,342.5,77.7,N,3,7/22/2016 20:21,36.46616,-121.89671,0.284073,0.0,0.0
4,2161142584,36.46112,-121.90392,7/23/2016 3:00,7/22/2016 20:00,6.4,318,36.8,14.0,24.4,...,301.6,55.2,N,3,7/22/2016 20:21,36.46616,-121.89671,0.53142,0.0,0.0


In [5]:
features_short = [
       'WindSpeedMph', 
       'SurfaceWindGustsMph',
       'ZeroToTenLiquidSoilMoisturePercent',
       'TenToFortyLiquidSoilMoisturePercent',
       'FortyToOneHundredLiquidSoilMoisturePercent',
       'SurfaceTemperatureFahrenheit', 
       'SurfaceDewpointTemperatureFahrenheit',
       'SurfaceWetBulbTemperatureFahrenheit', 
       'RelativeHumidityPercent',
       'brightness', 
       'bright_t31', 
       'frp', 
       'speed_mph'] 

# preview our df
print('Display df')
display(df[features_short].head())

# look at statistics of df
print('Describe dataframe')
display(df[features_short].describe())

Display df


Unnamed: 0,WindSpeedMph,SurfaceWindGustsMph,ZeroToTenLiquidSoilMoisturePercent,TenToFortyLiquidSoilMoisturePercent,FortyToOneHundredLiquidSoilMoisturePercent,SurfaceTemperatureFahrenheit,SurfaceDewpointTemperatureFahrenheit,SurfaceWetBulbTemperatureFahrenheit,RelativeHumidityPercent,brightness,bright_t31,frp,speed_mph
0,6.4,36.8,14.0,24.4,25.4,60.4,50.3,54.6,70,367.0,309.3,77.2,0.0
1,6.4,36.8,14.0,24.4,25.4,60.4,50.3,54.6,70,267.7,306.9,77.2,0.0
2,6.4,36.8,14.0,24.4,25.4,60.4,50.3,54.6,70,367.0,306.1,77.7,0.0
3,6.4,36.8,14.0,24.4,25.4,60.4,50.3,54.6,70,367.0,342.5,77.7,0.0
4,6.4,36.8,14.0,24.4,25.4,60.4,50.3,54.6,70,356.6,301.6,55.2,0.0


Describe dataframe


Unnamed: 0,WindSpeedMph,SurfaceWindGustsMph,ZeroToTenLiquidSoilMoisturePercent,TenToFortyLiquidSoilMoisturePercent,FortyToOneHundredLiquidSoilMoisturePercent,SurfaceTemperatureFahrenheit,SurfaceDewpointTemperatureFahrenheit,SurfaceWetBulbTemperatureFahrenheit,RelativeHumidityPercent,brightness,bright_t31,frp,speed_mph
count,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0,13818.0
mean,4.545781,23.536286,13.847829,22.251563,23.355384,62.288197,51.163511,55.729831,70.660226,333.765769,299.103843,47.553807,0.049042
std,2.032529,7.382075,0.971136,1.828861,1.642913,7.603074,4.100671,3.157922,20.616516,23.644904,11.751736,162.295969,0.3526
min,0.1,3.6,12.3,17.9,19.1,39.2,16.8,38.1,11.0,208.0,260.2,0.2,0.0
25%,3.1,18.5,12.6,21.1,22.5,56.7,49.6,54.1,55.0,314.1,291.7,3.7,0.017262
50%,4.2,23.9,14.2,22.5,23.5,61.3,51.5,56.1,71.0,333.6,296.7,11.7,0.02727
75%,5.8,27.5,14.7,23.8,24.7,68.0,53.7,58.0,91.0,349.5,304.7,36.6,0.042211
max,17.9,55.1,15.6,24.7,25.6,91.6,60.0,64.3,100.0,502.1,400.1,5452.3,33.725228


In [6]:
# choose features
input_features = features_short[:-1]

## These are the features that we will put in the model

In [7]:
input_features = [
       'WindSpeedMph', 
#        'SurfaceWindGustsMph',
#        'ZeroToTenLiquidSoilMoisturePercent',
#        'TenToFortyLiquidSoilMoisturePercent',
#        'FortyToOneHundredLiquidSoilMoisturePercent',
#        'SurfaceTemperatureFahrenheit', 
#        'SurfaceDewpointTemperatureFahrenheit',
#        'SurfaceWetBulbTemperatureFahrenheit', 
       'RelativeHumidityPercent',
       'brightness', 
       'bright_t31', 
       'frp' 
]

In [8]:
y = np.array(df['speed_mph'])
X = np.array(df[input_features])
print(y.shape)
print(X.shape)

(13818,)
(13818, 5)


## Make a train/test split for the model

In [9]:
# make train test split
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size = 0.25, random_state = 137)
train_features.shape
test_features.shape
train_labels.shape
test_labels.shape

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (10363, 5)
Training Labels Shape: (10363,)
Testing Features Shape: (3455, 5)
Testing Labels Shape: (3455,)


## Train and test a random forest model using Kfold Validation
Here we're going to split our training data into three-folds. For each round, two folds will be used for training, and one fold will be used for validation. 

In [10]:
import time
from sklearn.model_selection import KFold

# Instantiate model with 100 decision trees with a depth of 2
rf = RandomForestRegressor(
    n_estimators = 100,
    max_depth = 2,
    n_jobs= -1, 
    random_state = 137,
    verbose=1
    )

# Set up cross validation
kf = KFold(n_splits=3, shuffle=True, random_state=8)

# Track start time
start_time = time.time()
# Keep track of MAE for each fit
all_mae = []
for train_index, test_index in kf.split(train_features):
    X_train, X_test = train_features[train_index], train_features[test_index]
    y_train, y_test = train_labels[train_index], train_labels[test_index]
    
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    errors = (abs(predictions - y_test))
    mae = np.mean(errors)
    all_mae.append(mae)
    
print("--- %s seconds ---" % (time.time() - start_time))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    0.1s


--- 1.3670573234558105 seconds ---


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.0s finished


## Display accuracy of the model
Let's check on how the model did on the training data. 

In [12]:
(f'Average Random Forest Mean Absolute Error over three folds: {np.mean(all_mae)}') # THIS IS OUR VERIFICATION CODE(0.034933)

'Average Random Forest Mean Absolute Error over three folds: 0.034933263379460365'

MAE: is it the same as yours?  
Mean Absolute Error: 0.0034933 mph.

## Export Predictive Model Markup Language File.

Although we haven't officialy tested the model on the test data, lets save it. 

https://collaborate.pega.com/discussion/creating-pmml-python-r-and-pega

In [None]:
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

# instantiate PMMLPipeline object
pipeline = PMMLPipeline([
        ('random_forest', rf)])

# train
pipeline.fit(train_features, train_labels)

# save
sklearn2pmml(pipeline, "randomforest.pmml", with_repr = True)

In [15]:



# Import the lib
from project_lib import Project
project = Project(sc,"<ProjectId>", "<ProjectToken>")

# let's assume you have the pandas DataFrame  pandas_df which contains the data
# you want to save in your object storage as a csv file
project.save_data("file_name.csv", pandas_df.to_csv(index=False))

# the function returns a dict which contains the asset_id, bucket_name and file_name
# upon successful saving of the data

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.3s finished


# Complete Challenge

In [None]:
# Verification

import ww
ww = ww.WatsonWarriors()
 
ww.answer(0, np.mean(all_mae))

### Enter code for completion below. 

In [1]:
## Past validation code below.
