# Stability of the Grid System

Hi everyone, this is Qudus again. It's an honour once again to be part of the third stage of the Hamoye Data Science Internship. This work is done to build a model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

#### <u>Dataset Description</u>
It has 12 primary predictive features and two dependent variables.

<u>Predictive features:</u>

- 'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
- 'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
- 'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

<u>Dependent variables:</u>

- 'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
- 'stabf': a categorical (binary) label ('stable' or 'unstable').

'stab' will be dropped since 'stabf' is deduced from 'stab', therefore, for this work, we will be using 'stabf' as our dependent variable.

To start with, I imported Numpy and Pandas

In [1]:
import numpy as np
import pandas as pd

I imported data from the local directory using Pandas

In [2]:
data = pd.read_csv('Data_for_UCI_named.csv')
data.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [3]:
data.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


I checked for missing values before beginning the analysis

In [4]:
data.isnull().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

Yipee! There is no missing value. I'm so happy

Ok, the next thing is to specify the features and the target variable

### Preprocessing

In [5]:
X = data.iloc[:,:-2]
y = data.iloc[:,13]

Moving to train-test split. 80% of the data will be used for training the model while the remaining 20% of the data will be used for evaluating the model. To do that, I imported the train_test_split from sklearn.

In [6]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1 )

The data is rescaled using the standard scaler

In [7]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

### Fitting the Model

I am going to fit the model using four Ensemble methods. Both bagging (Random forest and Extra Trees Classifiers) and boosting methods (Extreme boosting and Light gradient boosting models) will be used. Then, we will compare their accuracies.

##### <u>Random Forest Classifier</u>

In [8]:
# Fit the model
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(criterion='entropy', random_state=1)
rf_classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [9]:
# Predicting Test set results
rf_pred = rf_classifier.predict(x_test)

##### Model Evaluation

The model will be evaluated using Cross Validation score, accuracy and the confusion matrix will also be obtained

In [10]:
from sklearn.model_selection import cross_val_score 
rf_scores = cross_val_score(rf_classifier, x_train, y_train, cv=5, scoring='f1_macro') 
rf_scores.mean()

0.9104907148651128

The accuracy of the model using the cross_val_score is gotten as 91%

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix 
confusion_matrix(y_test, rf_pred)

array([[ 622,   90],
       [  57, 1231]], dtype=int64)

From the above confusion matrix, we can deduce that the model got 1853 classifications correctly and was wrong in 147 classifications which amounts to 0.9265 accuracy. From the confusion matrix, we can obtain the accuracy, recall score and f1 score. The parameters observed there are
- True Positive = 622
- False Positive = 90
- True Negative = 1231
- False Negative = 57

In [12]:
rf_accuracy = accuracy_score(y_test, rf_pred)
rf_accuracy

0.9265

##### <u>Extra Trees Classifier</u>

In [13]:
# fit the model
from sklearn.ensemble import ExtraTreesClassifier
ext_classifier = ExtraTreesClassifier(n_estimators=100, criterion='entropy', random_state=1)
ext_classifier.fit(x_train, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

In [14]:
# Predicting Test set results
ext_pred = ext_classifier.predict(x_test)

In [15]:
ext_accuracy = accuracy_score(y_test, ext_pred)
ext_accuracy

0.9285

In [16]:
ext_scores = cross_val_score(ext_classifier, x_train, y_train, cv=5, scoring='f1_macro') 
ext_scores.mean()

0.9140830151045914

The accuracy of the model using the cross_val_score is gotten as 91%

In [17]:
confusion_matrix(y_test, ext_pred)

array([[ 598,  114],
       [  29, 1259]], dtype=int64)

From the above confusion matrix, we can deduce that the model got 1857 classifications correctly and was wrong in 143 classifications which amounts to 0.9285 accuracy. From the confusion matrix, we can obtain the accuracy, recall score and f1 score. The parameters observed there are
- True Positive = 598
- False Positive = 114
- True Negative = 1259
- False Negative = 29

##### <u>Extra Boosting Method</u>

In [18]:
# fit the model
from xgboost import XGBClassifier
xgb_classifier = XGBClassifier(n_estimators=100, random_state=1)
xgb_classifier.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [19]:
# Predicting Test set results
xgb_pred = xgb_classifier.predict(x_test)

In [20]:
xgb_scores = cross_val_score(xgb_classifier, x_train, y_train, cv=5, scoring='f1_macro') 
xgb_scores.mean()

0.913159560444606

The accuracy of the model using the cross_val_score is gotten as 91%

In [21]:
confusion_matrix(y_test, xgb_pred)

array([[ 603,  109],
       [  52, 1236]], dtype=int64)

From the above confusion matrix, we can deduce that the model got 1839 classifications correctly and was wrong in 161 classifications which amounts to 0.9195 accuracy. From the confusion matrix, we can obtain the accuracy, recall score and f1 score. The parameters observed there are
- True Positive = 603
- False Positive = 109
- True Negative = 1236
- False Negative = 52

In [22]:
xgb_accuracy = accuracy_score(y_test, xgb_pred)
xgb_accuracy

0.9195

##### <u>Light Gradient Boosting Method</u>

In [23]:
# fit the model
from lightgbm import LGBMClassifier
lgbm_classifier = LGBMClassifier(boosting_type='gbdt', n_estimators=100, random_state=1)
lgbm_classifier.fit(x_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=1, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [24]:
# Predicting Test set results
lgbm_pred = lgbm_classifier.predict(x_test)

In [25]:
lgbm_scores = cross_val_score(lgbm_classifier, x_train, y_train, cv=5, scoring='f1_macro') 
lgbm_scores.mean()

0.9330569051679415

The accuracy of the model using the cross_val_score is gotten as 93%

In [26]:
confusion_matrix(y_test, lgbm_pred)

array([[ 635,   77],
       [  48, 1240]], dtype=int64)

From the above confusion matrix, we can deduce that the model got 1875 classifications correctly and was wrong in 125 classifications which amounts to 0.9375 accuracy. From the confusion matrix, we can obtain the accuracy, recall score and f1 score. The parameters observed there are
- True Positive = 635
- False Positive = 77
- True Negative = 1240
- False Negative = 48

In [27]:
lgbm_accuracy = accuracy_score(y_test, lgbm_pred)
lgbm_accuracy

0.9375

#### <u> Randomized Search CV </u>

I improved the Extra Trees Classifier by searching for the best model using RandomizedSearchCV. Let us see if there will be an improvement

In [28]:
from sklearn.model_selection import RandomizedSearchCV
model = ExtraTreesClassifier(criterion = 'entropy', random_state=1)
n_estimators = [50, 100, 300, 500, 1000] 
min_samples_split = [2, 3, 5, 7, 9] 
min_samples_leaf = [1, 2, 4, 6, 8] 
max_features = ['auto', 'sqrt', 'log2', None]  
hyperparameter_grid = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf, 'min_samples_split': min_samples_split, 'max_features': max_features} 
rscv = RandomizedSearchCV(model, hyperparameter_grid)

In [29]:
best_model_random = rscv.fit(x_train, y_train)
best_model_random.best_estimator_

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='log2',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=3,
                     min_weight_fraction_leaf=0.0, n_estimators=300,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

Ok, the above output is the best estimator give the prior parameters I supplied to the RandomizedSearchCV. Let us fit the optimized model to see if we will get a better accuracy 

In [30]:
opt_model = ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='log2',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=5,
                     min_weight_fraction_leaf=0.0, n_estimators=1000,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)
opt_classifier = opt_model.fit(x_train, y_train)

In [31]:
opt_pred = opt_classifier.predict(x_test)
opt_accuracy = accuracy_score(y_test, opt_pred)
opt_accuracy

0.936

The optimized Extra Trees model have an accuracy of 0.9360 which is an improvement to the previous Extra Trees Classifier with an accuracy score of 0.9285

Lastly, let us view the accuracy of all the models on a table

In [32]:
compare = pd.DataFrame(['Random Forest Classifier', 'Extra Trees Classifier', 'Extra Boosting Model', 'Light Gradient Boosting Model', 'Optimized Extra Trees'], columns=['Models'])
compare['Accuracy'] = [rf_accuracy, ext_accuracy, xgb_accuracy, lgbm_accuracy, opt_accuracy]
compare

Unnamed: 0,Models,Accuracy
0,Random Forest Classifier,0.9265
1,Extra Trees Classifier,0.9285
2,Extra Boosting Model,0.9195
3,Light Gradient Boosting Model,0.9375
4,Optimized Extra Trees,0.936


The light gradient boosting model performed best for this dataset with an accuracy score of 0.9375.

### Thank you for reading to the end!