The dataset has 12 primary predictive features and two dependent variables.

Predictive features:

'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');
Dependent variables:

'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
'stabf': a categorical (binary) label ('stable' or 'unstable').
Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).

Split the data into an 80-20 train-test split with a random state of “1”. Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). Use scikit learn to train a random forest and extra trees classifier. And use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set. Answer the following questions:

In [1]:
import pandas as pd

In [2]:
df= pd.read_csv('Data_for_UCI_named.csv')

In [3]:
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [4]:
# preprocessing 
# 'stab' should be dropped because of the direct relationship with 'stabf'
df.drop('stab',axis=1,inplace=True)  

X= df.drop(columns=['stabf'])  # features
y= df['stabf']              # target

In [5]:
#split the data into training and testing sets 

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)

# Normalization

In [6]:
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 

normalised_train_df = scaler.fit_transform(x_train) 
normalised_train_df = pd.DataFrame(normalised_train_df,columns=x_train.columns) 

x_test = x_test.reset_index(drop= True ) 
normalised_test_df = scaler.transform(x_test) 
normalised_test_df = pd.DataFrame(normalised_test_df,columns=x_test.columns) 


# Training Model

In [7]:
from sklearn.ensemble import RandomForestClassifier

rf= RandomForestClassifier(random_state=1)
rf.fit(normalised_train_df,y_train)


RandomForestClassifier(random_state=1)

In [8]:
from sklearn.ensemble import ExtraTreesClassifier

etc= ExtraTreesClassifier(random_state=1)
etc.fit(normalised_train_df,y_train)




ExtraTreesClassifier(random_state=1)

In [9]:
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(random_state=1)
lgb.fit(normalised_train_df, y_train)




  import pandas.util.testing as tm


LGBMClassifier(random_state=1)

In [10]:
from xgboost import XGBClassifier

xgb = XGBClassifier(random_state=1)
xgb.fit(normalised_train_df, y_train)








XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

Find the feature importance using the optimal ExtraTreesClassifier model. Which features are the most and least important respectively?

In [11]:
weights = pd.Series(etc.feature_importances_, normalised_train_df.columns).sort_values()
weights_df = pd.DataFrame(weights).reset_index()
weights_df.columns = ['Features', 'extra tree']
weights_df

Unnamed: 0,Features,extra tree
0,p1,0.039507
1,p2,0.040371
2,p4,0.040579
3,p3,0.040706
4,g1,0.089783
5,g2,0.093676
6,g4,0.094019
7,g3,0.096883
8,tau3,0.113169
9,tau4,0.115466


5. Using the ExtraTreesClassifier as your estimator with cv=5, n_iter=10, scoring = 'accuracy', n_jobs = -1, verbose = 1 and random_state = 1. What are the best hyperparameters from the randomized search CV?

In [12]:
from sklearn.model_selection import RandomizedSearchCV
random_grid = {'n_estimators': [100,300,500],'min_samples_split':[2,5,7],
               'min_samples_leaf': [8], 
               }
etr = RandomizedSearchCV(estimator = etc, param_distributions = random_grid, cv=5,n_iter=10, 
                               scoring = 'accuracy', n_jobs = -1, verbose = 1,random_state = 1 )
etr.fit(normalised_train_df,y_train)
et_best=etr.best_estimator_
print(et_best)



Fitting 5 folds for each of 9 candidates, totalling 45 fits
ExtraTreesClassifier(min_samples_leaf=8, n_estimators=500, random_state=1)


# Evaluating model

What is the accuracy on the test set using the XGboost classifier? In 4 decimal places.



In [13]:
from sklearn.metrics import accuracy_score
xg_pred = xgb.predict(normalised_test_df)

accuracy=accuracy_score(xg_pred,y_test)
print( 'Accuracy: {}' .format(round(accuracy,4)))

Accuracy: 0.9455


What is the accuracy on the test set using the random forest classifier? In 4 decimal places.

In [14]:
rf_pred = rf.predict(normalised_test_df)

accuracy=accuracy_score(rf_pred,y_test)
print( 'Accuracy: {}' .format(round(accuracy,4)))

Accuracy: 0.929


What is the accuracy on the test set using the LGBM classifier? In 4 decimal places.

In [15]:
lg_pred = lgb.predict(normalised_test_df)

accuracy=accuracy_score(lg_pred,y_test)
print( 'Accuracy: {}' .format(round(accuracy,4)))

Accuracy: 0.9395


Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?

In [17]:
etr_pred = etr.predict(normalised_test_df)
accuracy=accuracy_score(etr_pred,y_test)
print( 'Accuracy: {}' .format(round(accuracy,4)))

Accuracy: 0.9095
