In [1]:

"""
Part of the AI implementation in this project was adapted from an open-source Fire Prediction System 
available on GitHub, created by developer vaniseth. This code was used to establish a preliminary 
model for fire prediction, and then it was refined to suit our Digital Twin system. 
The original repository can be accessed at: https://github.com/vaniseth/Forest-Fire-Prediction-System
"""


"""
Algerian Fires Dataset Analysis for Digital Twin Fire Prediction

This Python script is dedicated to the analysis of the Algerian Fires dataset,
which is a rare and valuable compilation of fire occurrence data in Algeria.
The dataset encompasses records from two distinct regions of Algeria and is
classified into 'fire' and 'not fire' instances, which makes it particularly
useful for developing predictive models in fire management systems.

Data set Availability:
The dataset is available at the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++

Data Set Information:
- Contains 244 instances of fire data from two regions in Algeria.
- Each region contributes 122 instances, totaling 244 instances with labels
  'fire' (138 instances) and 'not fire' (106 instances).
- Timeframe of data: June 2012 to September 2012.

Attribute Information:
Weather data observations and FWI Components such as temperature, relative humidity,
wind speed, rain, and various indices from the FWI system like FFMC, DMC, DC, ISI, and BUI.

Purpose:
The aim of this script is to use machine learning techniques to build a model that will
be deployed as part of a digital twin for fire prediction. This model aims to reduce the
likelihood of fires by accurately predicting fire occurrences based on weather and FWI components.

By leveraging the power of machine learning and the specific insights offered by this dataset,
the model will serve as a critical component in a digital twin system, enhancing the ability
to anticipate fire outbreaks and enabling proactive measures in fire-prone areas.
"""


"\nAlgerian Fires Dataset Analysis for Digital Twin Fire Prediction\n\nThis Python script is dedicated to the analysis of the Algerian Fires dataset,\nwhich is a rare and valuable compilation of fire occurrence data in Algeria.\nThe dataset encompasses records from two distinct regions of Algeria and is\nclassified into 'fire' and 'not fire' instances, which makes it particularly\nuseful for developing predictive models in fire management systems.\n\nData set Availability:\nThe dataset is available at the UCI Machine Learning Repository:\nhttps://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++\n\nData Set Information:\n- Contains 244 instances of fire data from two regions in Algeria.\n- Each region contributes 122 instances, totaling 244 instances with labels\n  'fire' (138 instances) and 'not fire' (106 instances).\n- Timeframe of data: June 2012 to September 2012.\n\nAttribute Information:\nWeather data observations and FWI Components such as temperature, relative hu

**FWI System Components**
5. FFMC: Fine Fuel Moisture Code (28.6 to 92.5)
6. DMC: Duff Moisture Code (1.1 to 65.9)
7. DC: Drought Code (7 to 220.4)
8. ISI: Initial Spread Index (0 to 18.5)
9. BUI: Buildup Index (1.1 to 68)
10. FWI: Fire Weather Index (0 to 31.1)

**Classification Outcome**

- Fire or not Fire

In [2]:
import warnings
warnings.filterwarnings("ignore")
# **Import Required Library**
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,mean_absolute_percentage_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import bz2,pickle

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

In [4]:
# Importing the dataset and dropping irrelevant features for analysis
df1 = pd.read_csv('Algerian_forest_fire_cleaned-data.csv')
# Days, months, and years are excluded from the analysis to focus on environmental factors
df2 = df1.drop(['day','month','year'], axis=1)
df2.head(10)

Unnamed: 0,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,0,1
1,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,0,1
2,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,0,1
3,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,0,1
4,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,0,1
5,31,67,14,0.0,82.6,5.8,22.2,3.1,7.0,2.5,1,1
6,33,54,13,0.0,88.2,9.9,30.5,6.4,10.9,7.2,1,1
7,30,73,15,0.0,86.6,12.1,38.3,5.6,13.5,7.1,1,1
8,25,88,13,0.2,52.9,7.9,38.8,0.4,10.5,0.3,0,1
9,28,79,12,0.0,73.2,9.5,46.3,1.3,12.6,0.9,0,1


In [5]:
# Splitting dataset for regression to predict the Fire Weather Index (FWI)
X = df2.iloc[:,0:10] # Input features
y= df2['FWI'] # Target variable

In [6]:
# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)

In [7]:
# Identifying features with high correlation to eliminate redundancy
def correlation(dataset, threshold):
    col_corr = set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: 
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
    return col_corr

In [8]:
corr_features = correlation(X_train, 0.8)
corr_features

{'BUI', 'DC', 'FWI'}

In [9]:
# Removing features with correlation above 0.8 threshold
X_train.drop(corr_features,axis=1, inplace=True)
X_test.drop(corr_features,axis=1, inplace=True)
X_train.shape, X_test.shape

((182, 7), (61, 7))

In [10]:
# Scaling features to standardize the dataset
def scaler_standard(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled

In [11]:
X_train_scaled, X_test_scaled = scaler_standard(X_train, X_test)

In [12]:
# Training the Random Forest Regressor
Random_Forest_Regressor = RandomForestRegressor()
Random_Forest_Regressor.fit(X_train_scaled, y_train)

In [13]:
Random_Forest_Regressor_prediction = Random_Forest_Regressor.predict(X_test_scaled)
Random_Forest_Regressor_prediction

array([1.0408e+01, 7.4950e+00, 7.3860e+00, 4.6730e+00, 8.1080e+00,
       1.2431e+01, 2.3100e-01, 9.4950e+00, 6.0850e+00, 1.2801e+01,
       1.5800e+00, 1.4396e+01, 6.3120e+00, 1.5753e+01, 7.1200e-01,
       1.9300e-01, 1.2840e+00, 2.3180e+00, 5.3780e+00, 1.9600e-01,
       4.9130e+00, 5.8890e+00, 2.3210e+00, 2.2300e-01, 2.7860e+00,
       2.9300e+00, 1.0101e+01, 3.3200e-01, 1.5400e-01, 8.7700e-01,
       1.5449e+01, 4.2900e-01, 2.6300e-01, 2.2061e+01, 5.1480e+00,
       8.8300e-01, 9.3200e-01, 1.6326e+01, 2.7861e+01, 1.3610e+00,
       6.8270e+00, 8.6700e-01, 1.0700e-01, 2.0250e+00, 8.7900e-01,
       1.2000e-02, 3.4370e+00, 7.3270e+00, 1.4000e-02, 3.3750e+00,
       5.1570e+00, 1.0709e+01, 1.2000e-02, 2.4160e+00, 6.5820e+00,
       2.1010e+00, 1.2054e+01, 3.4320e+00, 3.4300e+00, 1.9086e+01,
       3.3920e+00])

In [14]:
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Random_Forest_Regressor_prediction})    
#Actual_predicted

In [15]:
meanAbErr = metrics.mean_absolute_error(y_test, Random_Forest_Regressor_prediction)
meanSqErr = metrics.mean_squared_error(y_test, Random_Forest_Regressor_prediction)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, Random_Forest_Regressor_prediction))

In [16]:
# Output the error metrics
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

Mean Absolute Error: 0.5799508196721314
Mean Square Error: 0.7116612622950819
Root Mean Square Error: 0.8436001791696597


In [17]:
# To find coefficient of determination
r2 =  r2_score(y_test, Random_Forest_Regressor_prediction)
print("R-Square:",r2)

R-Square: 0.9794995337466419


In [18]:
# Hyperparameter tuning for Random Forest Regressor
param_grid =[{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110,120],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 3, 4],
'min_samples_split': [2, 6, 10],
'n_estimators': [5, 20, 50, 100]}]

In [19]:
Random_Forest_Regressor = RandomForestRegressor()
Random_rf = RandomizedSearchCV(Random_Forest_Regressor,param_grid, cv = 10, verbose=2,n_jobs = -1)
Random_rf.fit(X_train_scaled, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [20]:
best_random_grid=Random_rf.best_estimator_

In [21]:
bestrf_pred = best_random_grid.predict(X_test_scaled)
bestrf_pred

array([1.0745e+01, 6.9330e+00, 7.8730e+00, 4.4840e+00, 8.5300e+00,
       1.2751e+01, 2.6200e-01, 9.9490e+00, 5.2410e+00, 1.3157e+01,
       1.3920e+00, 1.3714e+01, 5.6050e+00, 1.8699e+01, 7.3300e-01,
       2.7800e-01, 1.4270e+00, 2.1080e+00, 4.6430e+00, 3.8200e-01,
       5.6650e+00, 6.7640e+00, 3.1260e+00, 2.7000e-01, 2.7520e+00,
       2.8900e+00, 1.0440e+01, 3.5400e-01, 1.8100e-01, 1.1610e+00,
       1.7619e+01, 4.1900e-01, 2.7900e-01, 2.1230e+01, 5.0670e+00,
       9.3800e-01, 9.1100e-01, 1.4268e+01, 2.7275e+01, 1.5930e+00,
       7.8620e+00, 8.8000e-01, 1.1600e-01, 2.3100e+00, 8.3800e-01,
       1.6000e-02, 3.5690e+00, 7.6040e+00, 3.3300e-01, 4.1590e+00,
       5.0920e+00, 1.1443e+01, 1.3000e-02, 2.1060e+00, 7.0200e+00,
       2.2480e+00, 1.2414e+01, 3.5770e+00, 4.0210e+00, 1.8558e+01,
       3.3090e+00])

In [22]:
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': bestrf_pred})    
#Actual_predicted

In [23]:
# Feature selection for deployment based on importance
feature_importances = Random_rf.best_estimator_.feature_importances_
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': feature_importances
}).sort_values('importance', ascending=False)
importance_df

Unnamed: 0,feature,importance
6,ISI,0.363494
4,FFMC,0.244298
5,DMC,0.189525
3,Rain,0.087751
0,Temperature,0.050297
1,RH,0.049196
2,Ws,0.015438


In [24]:
# Training final model with selected features
X_train_new = X_train.drop(['Rain', 'RH'], axis=1)
X_test_new = X_test.drop(['Rain', 'RH'], axis=1)

In [25]:
X_train_new_scaled, X_test_new_scaled = scaler_standard(X_train_new, X_test_new)

In [26]:
best_random_grid.fit(X_train_new_scaled, y_train)
bestrf_pred = best_random_grid.predict(X_test_new_scaled)
bestrf_pred

array([1.0379e+01, 7.2700e+00, 8.1380e+00, 4.7540e+00, 8.5370e+00,
       1.2327e+01, 2.5000e-01, 8.9680e+00, 6.0580e+00, 1.3172e+01,
       1.3370e+00, 1.4522e+01, 6.1140e+00, 1.7187e+01, 6.0700e-01,
       2.1600e-01, 1.4800e+00, 2.0890e+00, 4.4710e+00, 2.6700e-01,
       6.9170e+00, 6.0280e+00, 3.3730e+00, 2.5200e-01, 2.7610e+00,
       2.9280e+00, 1.0916e+01, 3.2400e-01, 1.6000e-01, 9.9000e-01,
       1.7427e+01, 4.1300e-01, 2.7300e-01, 2.1667e+01, 4.7420e+00,
       8.2700e-01, 8.2700e-01, 1.5124e+01, 2.7523e+01, 1.4800e+00,
       7.2700e+00, 7.9500e-01, 9.1000e-02, 1.8680e+00, 8.7000e-01,
       2.3000e-02, 3.4940e+00, 7.2080e+00, 4.0000e-03, 5.1600e+00,
       4.7910e+00, 1.0839e+01, 2.5000e-02, 2.2570e+00, 6.7280e+00,
       2.1000e+00, 1.1494e+01, 3.5060e+00, 3.3760e+00, 1.8937e+01,
       3.2210e+00])

**Classification**

In [27]:
# Preparing the input features (X) and the target variable (y)
X = df2.iloc[:, 0:10]
y = df2['Classes']

In [28]:
# Splitting the dataset into a training set and a test set with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
X_train.shape, X_test.shape

((170, 10), (73, 10))

In [29]:
# Identifying and removing features with high correlation to prevent multicollinearity
corr_features = correlation(X_train, 0.8)  # Correlation threshold set at 0.8
X_train.drop(corr_features, axis=1, inplace=True)  # Dropping correlated features from training set
X_test.drop(corr_features, axis=1, inplace=True)   # Dropping correlated features from test set

In [30]:
# Standardizing the features to have zero mean and unit variance
X_train_scaled, X_test_scaled = scaler_standard(X_train, X_test)

**Logistic Regression Model Training**

In [31]:
# Initializing the Logistic Regression model and training it on the scaled training data
Logistic_Regression  = LogisticRegression()
Logistic_Regression.fit(X_train_scaled,y_train)

In [32]:
print('Intercept is :',Logistic_Regression.intercept_)
print('Coefficient is :',Logistic_Regression.coef_)
print("Training Score:",Logistic_Regression.score(X_train_scaled, y_train))
print("Test Score:",Logistic_Regression.score(X_test_scaled,y_test))

Intercept is : [0.64908314]
Coefficient is : [[-0.00930744  0.24687675 -0.22764197 -0.53826457  2.53955465  0.9313307
   2.75456708]]
Training Score: 0.9705882352941176
Test Score: 0.9178082191780822


In [33]:
Logistic_Regression_Prediction = Logistic_Regression.predict(X_test_scaled)
Logistic_Regression_Prediction

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1], dtype=int64)

In [34]:
# Comparing the actual class labels with the predicted ones
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Logistic_Regression_Prediction})    
#Actual_predicted
Score = accuracy_score(y_test,Logistic_Regression_Prediction)
Classification_Report = classification_report(y_test,Logistic_Regression_Prediction)

In [35]:
print("Logistic Regression")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

Logistic Regression
Accuracy Score value: 0.9178
              precision    recall  f1-score   support

           0       0.85      0.97      0.91        30
           1       0.97      0.88      0.93        43

    accuracy                           0.92        73
   macro avg       0.91      0.93      0.92        73
weighted avg       0.92      0.92      0.92        73



Precision: Measures the accuracy of positive predictions.
Recall: Measures the coverage of actual positive cases.
F1 Score: Harmonic mean of precision and recall, a balance between the two.

In [36]:
# **Decision Tree Classifier**
# Initialize and train the Decision Tree Classifier model
Decision_Tree_Classifier = DecisionTreeClassifier()
Decision_Tree_Classifier.fit(X_train_scaled,y_train)
Decision_Tree_Classifier_prediction = Decision_Tree_Classifier.predict(X_test_scaled)
Decision_Tree_Classifier_prediction

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1], dtype=int64)

In [37]:
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Decision_Tree_Classifier_prediction})    
#Actual_predicted
Score = accuracy_score(y_test,Decision_Tree_Classifier_prediction)
Classification_Report = classification_report(y_test,Decision_Tree_Classifier_prediction)
print("Decision Tree")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

Decision Tree
Accuracy Score value: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       1.00      1.00      1.00        43

    accuracy                           1.00        73
   macro avg       1.00      1.00      1.00        73
weighted avg       1.00      1.00      1.00        73



In [38]:
# **Random Forest Classifier**
# Initialize and train the Random Forest Classifier model
Random_Forest_Classifier = RandomForestClassifier()
Random_Forest_Classifier.fit(X_train_scaled,y_train)

In [39]:
print("Training Score:",Random_Forest_Classifier.score(X_train_scaled, y_train))
print("Test Score:",Random_Forest_Classifier.score(X_test_scaled,y_test))

Training Score: 1.0
Test Score: 0.9726027397260274


In [40]:
Random_Forest_Classifier_prediction = Random_Forest_Classifier.predict(X_test_scaled)
Random_Forest_Classifier_prediction
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Random_Forest_Classifier_prediction})    
#Actual_predicted

In [41]:
Score = accuracy_score(y_test,Random_Forest_Classifier_prediction)
Classification_Report = classification_report(y_test,Random_Forest_Classifier_prediction)
print("Random Forest")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

Random Forest
Accuracy Score value: 0.9726
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        30
           1       1.00      0.95      0.98        43

    accuracy                           0.97        73
   macro avg       0.97      0.98      0.97        73
weighted avg       0.97      0.97      0.97        73



In [42]:
# **K-Nearest Neighbors Classifier**
# Initialize and train the K-Nearest Neighbors Classifier
K_Neighbors_Classifier = KNeighborsClassifier()
K_Neighbors_Classifier.fit(X_train_scaled,y_train)
print("Training Score:",K_Neighbors_Classifier.score(X_train_scaled, y_train))
print("Test Score:",K_Neighbors_Classifier.score(X_test_scaled,y_test))

Training Score: 0.9647058823529412
Test Score: 0.9315068493150684


In [43]:
K_Neighbors_Classifier_prediction = K_Neighbors_Classifier.predict(X_test_scaled)
K_Neighbors_Classifier_prediction
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': K_Neighbors_Classifier_prediction})    
#Actual_predicted
Score = accuracy_score(y_test,K_Neighbors_Classifier_prediction)
Classification_Report = classification_report(y_test,K_Neighbors_Classifier_prediction)
print("KNeighbors Classifier")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

KNeighbors Classifier
Accuracy Score value: 0.9315
              precision    recall  f1-score   support

           0       0.90      0.93      0.92        30
           1       0.95      0.93      0.94        43

    accuracy                           0.93        73
   macro avg       0.93      0.93      0.93        73
weighted avg       0.93      0.93      0.93        73



In [44]:
# **XGBoost Classifier**
# Initialize and train the XGBoost Classifier model
xgb = XGBClassifier()
xgb.fit(X_train_scaled,y_train)

In [45]:
print("Training Score:",xgb.score(X_train_scaled, y_train))
print("Test Score:",xgb.score(X_test_scaled,y_test))
xgb_predic = xgb.predict(X_test_scaled)
xgb_predic

Training Score: 0.9941176470588236
Test Score: 0.9726027397260274


array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1])

In [46]:
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': xgb_predic})    
#Actual_predicted
Score = accuracy_score(y_test, xgb_predic)
Classification_Report = classification_report(y_test, xgb_predic)
print("XGboost Classifier")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

XGboost Classifier
Accuracy Score value: 0.9726
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        30
           1       1.00      0.95      0.98        43

    accuracy                           0.97        73
   macro avg       0.97      0.98      0.97        73
weighted avg       0.97      0.97      0.97        73



**Accuracy Score Results Summary**

In [47]:
# The summary table below shows the accuracy scores for different classifiers 
# based on the test dataset. These results help in determining the best model 
"""
| Models                  | Accuracy score |
| ----------------------- | -------------- |
| Logistic Regression     | 91.78%         |
| Decision Tree           | 94.52%         |
| Random Forest           | 97.26%         |
| K-Neighbors Classifier  | 93.15%         |
| XGBoost Classifier      | 97.26%         |
"""

'\n| Models                  | Accuracy score |\n| ----------------------- | -------------- |\n| Logistic Regression     | 91.78%         |\n| Decision Tree           | 94.52%         |\n| Random Forest           | 97.26%         |\n| K-Neighbors Classifier  | 93.15%         |\n| XGBoost Classifier      | 97.26%         |\n'

Hyperparameter Tuning and Model Selection

In [48]:
# **Hyperparameter Tuning for XGBoost Classifier**
params={
 "learning_rate"    : (np.linspace(0,10, 100)) ,
 "max_depth"        : (np.linspace(1,50, 25,dtype=int)),
 "min_child_weight" : [1, 3, 5, 7],
 "gamma"            : [0.0, 0.1, 0.2 , 0.3, 0.4],
 "colsample_bytree" : [0.3, 0.4, 0.5 , 0.7]}

In [49]:
# Initialize the RandomizedSearchCV object to perform hyperparameter tuning
Random_xgb = RandomizedSearchCV(xgb, params, cv = 10,n_jobs = -1)
Random_xgb.fit(X_train_scaled, y_train).best_estimator_

In [50]:
# Select the best estimator
Best_xgb = Random_xgb.best_estimator_
Best_xgb.score(X_test_scaled,y_test)

0.9041095890410958

In [51]:
# Predict and evaluate the best XGBoost model
Bestxgb_prediction = Best_xgb.predict(X_test_scaled)
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Bestxgb_prediction})    
#Actual_predicted
xgb_score = accuracy_score(y_test, Bestxgb_prediction)
xgb_report = classification_report(y_test, Bestxgb_prediction)
print("Best XGBoost Classifier - Accuracy Score: {:.4f}".format(xgb_score))
print(xgb_report)

Best XGBoost Classifier - Accuracy Score: 0.9041
              precision    recall  f1-score   support

           0       0.81      1.00      0.90        30
           1       1.00      0.84      0.91        43

    accuracy                           0.90        73
   macro avg       0.91      0.92      0.90        73
weighted avg       0.92      0.90      0.90        73



In [52]:
# **Hyperparameter Tuning for Random Forest Classifier**
# Define a parameter grid to search for the best parameters for Random Forest
rf_params = {
    "n_estimators": [90, 100, 115, 130],
    "criterion": ['gini', 'entropy'],
    "max_depth": range(2, 20, 1),
    "min_samples_leaf": range(1, 10, 1),
    "min_samples_split": range(2, 10, 1),
    "max_features": ['auto', 'log2']
}

In [53]:
# Initialize the RandomizedSearchCV object to perform hyperparameter tuning
Random_rf = RandomizedSearchCV(RandomForestClassifier(), rf_params, cv=10, n_jobs=-1)
Random_rf.fit(X_train_scaled, y_train)

In [54]:
# Select the best estimator
Best_rf = Random_rf.best_estimator_
Best_rf.score(X_test_scaled,y_test)

0.958904109589041

In [55]:
# Predict and evaluate the best Random Forest model
Bestrf_pred = Best_rf.predict(X_test_scaled)
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': Bestrf_pred})    
#Actual_predicted

In [56]:
Score = accuracy_score(y_test, Bestrf_pred)
Classification_Report = classification_report(y_test,Bestrf_pred)
print("FINAL Random Forest")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

FINAL Random Forest
Accuracy Score value: 0.9589
              precision    recall  f1-score   support

           0       0.91      1.00      0.95        30
           1       1.00      0.93      0.96        43

    accuracy                           0.96        73
   macro avg       0.95      0.97      0.96        73
weighted avg       0.96      0.96      0.96        73



In [57]:
# **Model Selection using Cross-Validation**
# apply Stratified K-Fold cross-validation to understand the consistent performance of our models across different folds.
# Stratified K-Fold maintains the proportion of the target class in each fold, leading to a more reliable cross-validation result.
skfold = StratifiedKFold(n_splits= 10,shuffle= True,random_state= 0)

In [58]:
# Calculate cross-validation scores for all models
# Calculate cross-validation scores for all models
cv_xgb_score = cross_val_score(Best_xgb, X, y, cv=skfold, scoring='accuracy').mean()
cv_rf_score = cross_val_score(Best_rf, X, y, cv=skfold, scoring='accuracy').mean()
cv_dt_score = cross_val_score(Decision_Tree_Classifier, X, y, cv=skfold, scoring='accuracy').mean()
cv_knn_score = cross_val_score(K_Neighbors_Classifier, X, y, cv=skfold, scoring='accuracy').mean()
cv_lg_score = cross_val_score(Logistic_Regression, X, y, cv=skfold, scoring='accuracy').mean()

In [59]:
# Print the mean cross-validation score for each model
print(f'Mean CV Accuracy - XGBoost: {cv_xgb_score:.4f}')
print(f'Mean CV Accuracy - Random Forest: {cv_rf_score:.4f}')
print(f'Mean CV Accuracy - Decision Tree: {cv_dt_score:.4f}')
print(f'Mean CV Accuracy - KNN: {cv_knn_score:.4f}')
print(f'Mean CV Accuracy - Logistic Regression: {cv_lg_score:.4f}')

Mean CV Accuracy - XGBoost: 0.9752
Mean CV Accuracy - Random Forest: 0.9793
Mean CV Accuracy - Decision Tree: 0.9712
Mean CV Accuracy - KNN: 0.9052
Mean CV Accuracy - Logistic Regression: 0.9630


In [60]:
"""
| Models                | Training Score | Test Score | Accuracy | 
|-----------------------|----------------|------------|----------|
| Logistic Regression   | 0.9706         | 0.9178     | 91.78%   | 
| Decision Tree         | 0.9752         | 0.9367     | 97.52%   | 
| Random Forest         | 1.0000         | 0.9863     | 98.63%   |
| K-Neighbors           | 0.9647         | 0.9315     | 93.15%   | 
| XGBoost               | 0.9941         | 0.9726     | 97.26%   | 
"""

'\n| Models                | Training Score | Test Score | Accuracy | \n|-----------------------|----------------|------------|----------|\n| Logistic Regression   | 0.9706         | 0.9178     | 91.78%   | \n| Decision Tree         | 0.9752         | 0.9367     | 97.52%   | \n| Random Forest         | 1.0000         | 0.9863     | 98.63%   |\n| K-Neighbors           | 0.9647         | 0.9315     | 93.15%   | \n| XGBoost               | 0.9941         | 0.9726     | 97.26%   | \n'

Model Deployment

The XGBoost Classifier has demonstrated superior performance over other models.
Hence, we will utilize this model for the final deployment phase.

Feature Importance for Deployment

In [61]:
X_train_new = X_train.drop(['Rain', 'RH'], axis=1)
X_test_new = X_test.drop(['Rain', 'RH'], axis=1)

In [62]:
# Reducing the feature set for the deployment model
X_train_new_scaled, X_test_new_scaled = scaler_standard(X_train_new, X_test_new)

In [63]:
xgb_model =Random_xgb.fit(X_train_new_scaled, y_train).best_estimator_
xgb_model.score(X_test_new_scaled, y_test)

0.9726027397260274

In [64]:
xgb_model_pred = xgb_model.predict(X_test_new_scaled)
xgb_model_pred

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1])

In [65]:
Actual_predicted = pd.DataFrame({'Actual Revenue': y_test, 'Predicted Revenue': xgb_model_pred})    
#Actual_predicted

In [66]:
Score = accuracy_score(y_test, xgb_model_pred)
Classification_Report = classification_report(y_test, xgb_model_pred)
print("Final Model XGB")
print ("Accuracy Score value: {:.4f}".format(Score))
print (Classification_Report)

Final Model XGB
Accuracy Score value: 0.9726
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        30
           1       0.98      0.98      0.98        43

    accuracy                           0.97        73
   macro avg       0.97      0.97      0.97        73
weighted avg       0.97      0.97      0.97        73



In [67]:
# Model Serialization for Deployment
# Compressing and saving the model to a binary file using BZ2 compression, 
import bz2,pickle
file = bz2.BZ2File('Classification.pkl','wb')
pickle.dump(best_random_grid,file)
file.close()