# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Exercise 1

In [8]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Input
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from tensorflow.keras.optimizers import Adam
import xgboost as xgb

In [9]:
#defining X
X=pd.DataFrame(data=D['data'],columns=D['feature_names'])

#encoding
en=LabelEncoder()
convert=X.select_dtypes(include=['object']).columns.tolist()

for col in convert:
    le = LabelEncoder()
    X[col] = en.fit_transform(X[col])

X


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,1,22,9,164,4773,0,0,0,0,...,9,9,100,0,11,0,0,0,0,0
1,0,1,22,9,222,465,0,0,0,0,...,19,19,100,0,5,0,0,0,0,0
2,0,1,22,9,218,1316,0,0,0,0,...,29,29,100,0,3,0,0,0,0,0
3,0,1,22,9,202,1316,0,0,0,0,...,39,39,100,0,3,0,0,0,0,0
4,0,1,22,9,200,2006,0,0,0,0,...,49,49,100,0,2,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,1,22,9,293,1856,0,0,0,0,...,86,255,100,0,1,5,0,1,0,0
494017,0,1,22,9,265,2254,0,0,0,0,...,6,255,100,0,17,5,0,1,0,0
494018,0,1,22,9,186,1179,0,0,0,0,...,16,255,100,0,6,5,6,1,0,0
494019,0,1,22,9,274,1179,0,0,0,0,...,26,255,100,0,4,5,4,1,0,0


In [10]:
#standardizing
scaler = StandardScaler()
X=scaler.fit_transform(X)
X=pd.DataFrame(data=X,columns=D['feature_names'])
X

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,-0.090604,0.925753,-0.104067,0.514274,-0.856640,3.281686,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.451536,-1.694315,0.599396,-0.282867,-1.022077,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464
1,-0.090604,0.925753,-0.104067,0.514274,-0.741563,0.079652,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.297085,-1.600011,0.599396,-0.282867,-1.146737,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464
2,-0.090604,0.925753,-0.104067,0.514274,-0.749499,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.142633,-1.505707,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464
3,-0.090604,0.925753,-0.104067,0.514274,-0.781245,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-2.988182,-1.411403,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464
4,-0.090604,0.925753,-0.104067,0.514274,-0.785213,1.225040,-0.006673,-0.048834,-0.002571,-0.049029,...,-2.833731,-1.317100,0.599396,-0.282867,-1.209067,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,-0.090604,0.925753,-0.104067,0.514274,-0.600693,1.113549,-0.006673,-0.048834,-0.002571,-0.049029,...,-2.262261,0.625558,0.599396,-0.282867,-1.229844,1.196724,-0.464438,-0.426441,-0.25204,-0.249464
494017,-0.090604,0.925753,-0.104067,0.514274,-0.656248,1.409373,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.497871,0.625558,0.599396,-0.282867,-0.897417,1.196724,-0.464438,-0.426441,-0.25204,-0.249464
494018,-0.090604,0.925753,-0.104067,0.514274,-0.812990,0.610351,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.343420,0.625558,0.599396,-0.282867,-1.125961,1.196724,-0.305196,-0.426441,-0.25204,-0.249464
494019,-0.090604,0.925753,-0.104067,0.514274,-0.638391,0.610351,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.188969,0.625558,0.599396,-0.282867,-1.167514,1.196724,-0.358277,-0.426441,-0.25204,-0.249464


In [11]:
#defining y
y=D['target']

#encoding
encoder=LabelEncoder()
y=encoder.fit_transform(y)
y

array([11, 11, 11, ..., 11, 11, 11])

In [12]:
np.unique(y)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22])

In [13]:
#checking dataset
df=pd.DataFrame(data=X,columns=D['feature_names'])
df['target']=D['target']
df.describe(include='all')

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
count,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,...,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021
unique,,,,,,,,,,,...,,,,,,,,,,23
top,,,,,,,,,,,...,,,,,,,,,,b'smurf.'
freq,,,,,,,,,,,...,,,,,,,,,,280790
mean,5.1663180000000005e-17,-1.647699e-16,-3.256276e-17,-3.5899580000000003e-17,-3.7740580000000004e-17,-1.445188e-16,-7.493462e-18,2.445084e-18,2.277883e-18,-4.1523270000000004e-17,...,-1.348536e-16,-2.124059e-16,3.6474890000000005e-17,-2.982427e-16,-5.983263000000001e-17,-3.9581590000000004e-17,5.799163000000001e-17,2.8995820000000004e-17,-5.822176e-17,
std,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,...,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,
min,-0.09060359,-0.8115496,-1.729084,-3.484214,-1.18203,-0.2659719,-0.006673418,-0.04883422,-0.002571468,-0.04902904,...,-1.779188,-1.834994,-0.2828667,-1.250621,-0.1756363,-0.4644375,-0.4634186,-0.2520395,-0.249464,
25%,-0.09060359,-0.8115496,-0.6949824,0.5142739,-1.104651,-0.2659719,-0.006673418,-0.04883422,-0.002571468,-0.04902904,...,-1.345391,-0.8368938,-0.2828667,-1.250621,-0.1756363,-0.4644375,-0.4634186,-0.2520395,-0.249464,
50%,-0.09060359,-0.8115496,-0.6949824,0.5142739,-0.2078445,-0.2659719,-0.006673418,-0.04883422,-0.002571468,-0.04902904,...,0.6255576,0.5993962,-0.2828667,0.8270476,-0.1756363,-0.4644375,-0.4634186,-0.2520395,-0.249464,
75%,-0.09060359,0.9257531,1.373221,0.5142739,0.7782453,-0.2659719,-0.006673418,-0.04883422,-0.002571468,-0.04902904,...,0.6255576,0.5993962,0.08323588,0.8270476,-0.1756363,-0.4644375,-0.4634186,-0.2520395,-0.249464,


In [14]:
#splitting the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1) #train = 70%
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=1) #test = 15%, validation = 15%
print("X_train.shape", X_train.shape, "y_train.shape", y_train.shape)
print("X_test.shape", X_test.shape, "y_test.shape", y_test.shape)
print("X_val.shape", X_test.shape, "y_val.shape", y_test.shape)

X_train.shape (345814, 41) y_train.shape (345814,)
X_test.shape (74104, 41) y_test.shape (74104,)
X_val.shape (74104, 41) y_val.shape (74104,)


In [None]:
#function to create a NN model
def create_nn(reg=tf.keras.regularizers.l2(0.1), learning_rate=0.001):
    model=keras.Sequential(
        [
            Input(shape=(X_train.shape[1],)),
            Dense(100, activation='relu',kernel_regularizer=reg),
            Dense(50, activation='relu',kernel_regularizer=reg),
            Dense(len(np.unique(y_train)), activation='softmax')
        ]
    )
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

#hyperparameter tuning for NN
nn_results=[]
for reg in [tf.keras.regularizers.l1(0.1),tf.keras.regularizers.l2(0.1)]: #hyperparameter 1: regularization
    for learning_rate in [0.001,0.01]: #hyperparameter 2: learning rate
        model = create_nn(reg, learning_rate)
        model.fit(X_train, y_train, epochs=10, verbose=0)
        loss, accuracy = model.evaluate(X_val, y_val, verbose=0)
        nn_results.append((reg, learning_rate, accuracy))

#finding best model (highest accuracy)
best_nn=max(nn_results, key=lambda x: x[2])
best_nn

(<keras.src.regularizers.regularizers.L2 at 0x30da24dd0>,
 0.001,
 0.9930097460746765)

In [None]:
#svm
svm=SVC()
svm.fit(X_train, y_train)
param_grid={'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear']}

grid=GridSearchCV(SVC(), param_grid, refit=True, scoring='accuracy', verbose=3)
grid.fit(X_train, y_train)
svm_best_params=grid.best_params_
print("Best Hyperparameters:", svm_best_params)

Fitting 5 folds for each of 16 candidates, totalling 80 fits




[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.999 total time=  11.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.999 total time=  10.1s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.999 total time=  11.1s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.999 total time=  10.8s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.999 total time=  10.8s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.999 total time=  10.9s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.999 total time=  10.7s
[CV 3/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.999 total time=  11.1s
[CV 4/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.999 total time=  10.8s
[CV 5/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.999 total time=  11.0s
[CV 1/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.999 total time=  10.9s
[CV 2/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.999 total time=  10.3s
[CV 3/5] END ..C=0.1, gamma=

In [None]:
#svm metrics
#Obtaining the best performing model from the grid search results
svm_best_model=grid.best_estimator_

#prediction
y_pred_svm=svm_best_model.predict(X_test)

#metrics
accuracy_svm=accuracy_score(y_test, y_pred_svm)
accuracy_svm

0.9994737126201014

In [None]:
#XGBoost
xgb_model=xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)
param_grid={
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3]
}
grid_search=GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=2)
grid_search.fit(X_train, y_train)
best_params=grid_search.best_params_
print("Best Hyperparameters:", best_params)


Fitting 3 folds for each of 27 candidates, totalling 81 fits




[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   4.6s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   4.6s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   4.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   9.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=  10.3s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=  10.1s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  19.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  19.6s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  19.7s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   4.7s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   4.8s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   4.8s
[CV] END ..learning_rate=0.0

In [None]:
#XGBoost metrics
#Obtaining the best performing model from the grid search results
xgb_best_model=grid_search.best_estimator_

#prediction
y_pred_xgb=xgb_best_model.predict(X_test)

#metrics
accuracy_xgb=accuracy_score(y_test, y_pred_xgb)
accuracy_xgb

0.9996356471985318

In [None]:
#printing best hyperparameters
print(f'Neural Network:\nRegularization: {best_nn[0]} \nLearning Rate: {best_nn[1]}\n')
print(f'SVM: {svm_best_params}\n')
print(f'XGB: {best_params}')


#ex1 results
ex1=pd.DataFrame({'':'Accuracy',
                  'NN':[best_nn[2]],
                  'SVM':[accuracy_svm],
                  'XGB':[accuracy_xgb]})
ex1

Neural Network:
Regularization: <keras.src.regularizers.regularizers.L2 object at 0x30da24dd0> 
Learning Rate: 0.001

SVM: {'C': 1, 'gamma': 1, 'kernel': 'linear'}

XGB: {'learning_rate': 0.3, 'max_depth': 5, 'n_estimators': 200}


Unnamed: 0,Unnamed: 1,NN,SVM,XGB
0,Accuracy,0.99301,0.999474,0.999636


# Exercise 2

In [None]:
from sklearn.ensemble import BaggingClassifier

#ensemble using XGBoost (best classifier from ex1)
ensemble_model=BaggingClassifier(xgb_best_model, n_estimators=25)
ensemble_model.fit(X_train, y_train)

#predictions and uncertainty
predictions=ensemble_model.predict(X_test)
probs=ensemble_model.predict_proba(X_test)
uncertainty=np.max(probs, axis=1)

#top/bottom percent of uncertainty (thresholds)
top10=np.percentile(uncertainty, 90)
bottom10=np.percentile(uncertainty, 10)

top10_percent=X_test[uncertainty>=top10]
bottom10_percent=X_test[uncertainty<=bottom10]

print(f'Top 10%: \n{top10_percent}')
print(f'Bottom 10%: \n{bottom10_percent}')

Top 10%: 
        duration  protocol_type   service      flag  src_bytes  dst_bytes  \
46291  -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
269681 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
96348  -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
12345  -0.090604       0.925753 -0.104067  0.514274  -0.791165   0.695084   
239489 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
...          ...            ...       ...       ...        ...        ...   
204623 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
266965 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
92318  -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
257998 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   
159059 -0.090604      -0.811550 -0.694982  0.514274   0.778245  -0.265972   

            land  wrong_fragment    urgent       hot  ...  dst_ho

# Exercise 3

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

#rfe
rf=RandomForestClassifier()
rfeselector=RFE(rf, n_features_to_select=10) #10 most important features
rfeselector=rfeselector.fit(X_train, y_train)
X_train=pd.DataFrame(X_train)
rfeselect=X_train.columns[rfeselector.support_]

#feature importance
rf.fit(X_train, y_train)
importances=rf.feature_importances_
indices=np.argsort(importances)[-10:] #10 most important features
importanceselect=X_train.columns[indices]

#redefining X_train to only include the selected features
X_rfe=X_train[rfeselect]
X_importance=X_train[importanceselect]


In [None]:
#redefining X_test and X_val
X_test=pd.DataFrame(X_test)
X_val=pd.DataFrame(X_val)

rfe_Xtest_sel=X_test.columns[rfeselector.support_]
rfe_Xval_sel=X_val.columns[rfeselector.support_]

fi_Xtest_sel=X_test.columns[indices]
fi_Xval_sel=X_val.columns[indices]

rfe_Xtest=X_test[rfe_Xtest_sel]
rfe_Xval=X_val[rfe_Xval_sel]

fi_Xtest=X_test[fi_Xtest_sel]
fi_Xval=X_val[fi_Xval_sel]

#### RFE Selections: Retraining classifiers from Exercise 1

In [None]:
#hyperparameter tuning for NN
#function to create a NN model
def create_nn_rfe(reg=tf.keras.regularizers.l2(0.1), learning_rate=0.001):
    model=keras.Sequential(
        [
            Input(shape=(X_rfe.shape[1],)),
            Dense(100, activation='relu',kernel_regularizer=reg),
            Dense(50, activation='relu',kernel_regularizer=reg),
            Dense(len(np.unique(y_train)), activation='softmax')
        ]
    )
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

rfe_nn_results=[]
for reg in [tf.keras.regularizers.l1(0.1),tf.keras.regularizers.l2(0.1)]: #hyperparameter 1: regularization
    for learning_rate in [0.001,0.01]: #hyperparameter 2: learning rate
        model = create_nn_rfe(reg, learning_rate)
        model.fit(X_rfe, y_train, epochs=10, verbose=0)
        loss, accuracy = model.evaluate(rfe_Xval, y_val, verbose=0)
        rfe_nn_results.append((reg, learning_rate, accuracy))

#finding best model (highest accuracy)
rfe_best_nn=max(rfe_nn_results, key=lambda x: x[2])
rfe_best_nn

(<keras.src.regularizers.regularizers.L2 at 0x35175c080>,
 0.001,
 0.9864377975463867)

In [None]:
#svm
svm=SVC()
svm.fit(X_rfe, y_train)
param_grid={'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear']}

rfe_grid=GridSearchCV(SVC(), param_grid, refit=True, scoring='accuracy', verbose=3)
rfe_grid.fit(X_rfe, y_train)
rfe_svm_best_params=rfe_grid.best_params_
print("Best Hyperparameters:", rfe_svm_best_params)

Fitting 5 folds for each of 16 candidates, totalling 80 fits




[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.5s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.4s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.5s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.7s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.4s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.4s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.4s
[CV 3/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.5s
[CV 4/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.6s
[CV 5/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.5s
[CV 1/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.996 total time=  10.4s
[CV 2/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.996 total time=  10.4s
[CV 3/5] END ..C=0.1, gamma=

In [None]:
#svm metrics
#Obtaining the best performing model from the grid search results
rfe_svm_best_model=rfe_grid.best_estimator_

#prediction
rfe_y_pred_svm=rfe_svm_best_model.predict(rfe_Xtest)

#metrics
rfe_accuracy_svm=accuracy_score(y_test, rfe_y_pred_svm)
rfe_accuracy_svm

0.9969772212026341

In [None]:
#XGBoost
xgb_model=xgb.XGBClassifier()
xgb_model.fit(X_rfe, y_train)
param_grid={
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3]
}
rfe_grid_search=GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=2)
rfe_grid_search.fit(X_rfe, y_train)
rfe_best_params=rfe_grid_search.best_params_
print("Best Hyperparameters:", rfe_best_params)

Fitting 3 folds for each of 27 candidates, totalling 81 fits




[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   2.8s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   2.9s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   3.0s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.6s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.6s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  11.0s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  11.0s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  10.9s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.9s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ..learning_rate=0.0

In [None]:
#XGBoost metrics
#Obtaining the best performing model from the grid search results
rfe_xgb_best_model=rfe_grid_search.best_estimator_

#prediction
rfe_y_pred_xgb=rfe_xgb_best_model.predict(rfe_Xtest)

#metrics
rfe_accuracy_xgb=accuracy_score(y_test, rfe_y_pred_xgb)
rfe_accuracy_xgb

0.9992982834934686

#### Feature Importance Selections: Retraining classifiers from Exercise 1

In [None]:
#hyperparameter tuning for NN
#function to create a NN model
def create_nn_fi(reg=tf.keras.regularizers.l2(0.1), learning_rate=0.001):
    model=keras.Sequential(
        [
            Input(shape=(X_importance.shape[1],)),
            Dense(100, activation='relu',kernel_regularizer=reg),
            Dense(50, activation='relu',kernel_regularizer=reg),
            Dense(len(np.unique(y_train)), activation='softmax')
        ]
    )
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

fi_nn_results=[]
for reg in [tf.keras.regularizers.l1(0.1),tf.keras.regularizers.l2(0.1)]: #hyperparameter 1: regularization
    for learning_rate in [0.001,0.01]: #hyperparameter 2: learning rate
        model = create_nn_fi(reg, learning_rate)
        model.fit(X_importance, y_train, epochs=10, verbose=0)
        loss, accuracy = model.evaluate(fi_Xval, y_val, verbose=0)
        fi_nn_results.append((reg, learning_rate, accuracy))

#finding best model (highest accuracy)
fi_best_nn=max(fi_nn_results, key=lambda x: x[2])
fi_best_nn

(<keras.src.regularizers.regularizers.L2 at 0x3521e0ce0>,
 0.001,
 0.9835768938064575)

In [None]:
#svm
svm=SVC()
svm.fit(X_importance, y_train)
param_grid={'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear']}

fi_grid=GridSearchCV(SVC(), param_grid, refit=True, scoring='accuracy', verbose=3)
fi_grid.fit(X_importance, y_train)
fi_svm_best_params=fi_grid.best_params_
print("Best Hyperparameters:", fi_svm_best_params)

Fitting 5 folds for each of 16 candidates, totalling 80 fits




[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.4s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.4s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.5s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.6s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.996 total time=  10.5s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.4s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.5s
[CV 3/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.6s
[CV 4/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.5s
[CV 5/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.996 total time=  10.5s
[CV 1/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.996 total time=  10.4s
[CV 2/5] END ..C=0.1, gamma=0.01, kernel=linear;, score=0.996 total time=  10.4s
[CV 3/5] END ..C=0.1, gamma=

In [None]:
#svm metrics
#Obtaining the best performing model from the grid search results
fi_svm_best_model=fi_grid.best_estimator_

#prediction
fi_y_pred_svm=fi_svm_best_model.predict(fi_Xtest)

#metrics
fi_accuracy_svm=accuracy_score(y_test, fi_y_pred_svm)
fi_accuracy_svm

0.9969772212026341

In [None]:
#XGBoost
xgb_model=xgb.XGBClassifier()
xgb_model.fit(X_importance, y_train)
param_grid={
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3]
}
fi_grid_search=GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=2)
fi_grid_search.fit(X_importance, y_train)
fi_best_params=fi_grid_search.best_params_
print("Best Hyperparameters:", fi_best_params)

Fitting 3 folds for each of 27 candidates, totalling 81 fits




[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   2.8s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   2.9s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   3.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.9s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.6s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   5.4s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  10.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  10.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  10.8s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ...learning_rate=0.01, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ..learning_rate=0.0

In [None]:
#XGBoost metrics
#Obtaining the best performing model from the grid search results
fi_xgb_best_model=fi_grid_search.best_estimator_

#prediction
fi_y_pred_xgb=fi_xgb_best_model.predict(fi_Xtest)

#metrics
fi_accuracy_xgb=accuracy_score(y_test, fi_y_pred_xgb)
fi_accuracy_xgb

0.9992982834934686

In [None]:
#Comparison for ex3
comparison=pd.DataFrame({'Accuracy':['Original','RFE','Feature Importance'],
                       'NN':[best_nn[2],rfe_best_nn[2],fi_best_nn[2]],
                       'SVM':[accuracy_svm,rfe_accuracy_svm,fi_accuracy_svm],
                       'XGB':[accuracy_xgb,rfe_accuracy_xgb,fi_accuracy_xgb]})
comparison

Unnamed: 0,Accuracy,NN,SVM,XGB
0,Original,0.99301,0.999474,0.999636
1,RFE,0.986438,0.996977,0.999298
2,Feature Importance,0.983577,0.996977,0.999298


# Exercise 4

In [15]:
from sklearn.cluster import KMeans, SpectralClustering, BisectingKMeans
from sklearn.metrics import adjusted_rand_score

#KMeans
kmeans=KMeans(n_clusters=23, random_state=1)
kmeans_labels=kmeans.fit_predict(X_train)

#Metrics - Adjusted Rand Index (ARI)
kmeans_ari=adjusted_rand_score(y_train,kmeans_labels)
kmeans_ari

0.8999002728452665

In [None]:
#spectral clustering
spectral=SpectralClustering(n_clusters=23,random_state=1)
spectral_labels=spectral.fit_predict(X_train)

#Metrics - Adjusted Rand Index (ARI)
spectral_ari=adjusted_rand_score(y_train,spectral_labels)
spectral_ari

In [None]:
#bisecting kmeans
bisect=BisectingKMeans(n_clusters=23,random_state=1)
bisect_labels=bisect.fit_predict(X_train)

#Metrics - Adjusted Rand Index (ARI)
bisect_ari=adjusted_rand_score(y_train,bisect_labels)
bisect_ari

*Attempts were made to use 3 different clustering algorithms but despite trying multiple, only kmeans would run without crashing.

The trained kmeans model gave an ARI of 0.8999, indicating that the proportion of agreements between the true clusters and the predicted clusters.

# Exercise 5

In [None]:
import matplotlib.pyplot as plt

#kmeans top 10%
kmeans_top=KMeans(n_clusters=23)
kmeans_top_labels=kmeans_top.fit_predict(top10_percent)

#kmeans bottom 10%
kmeans_bottom=KMeans(n_clusters=23)
kmeans_bottom_labels=kmeans_bottom.fit_predict(bottom10_percent)

#Visualize clusters
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(top10_percent[:, 0],top10_percent[:, 1],c=kmeans_top_labels)
plt.title("Top 10% Clusters")
plt.subplot(1, 2, 2)
plt.scatter(bottom10_percent[:, 0],bottom10_percent[:, 1],c=kmeans_bottom_labels)
plt.title("Bottom 10% Clusters")
plt.show()

# Analyze cluster characteristics
top_cluster_analysis=pd.Series(kmeans_top_labels).value_counts()
bottom_cluster_analysis=pd.Series(kmeans_bottom_labels).value_counts()

print("Top 10% Cluster Analysis:", top_cluster_analysis)
print("Bottom 10% Cluster Analysis:", bottom_cluster_analysis)


AttributeError: 'NoneType' object has no attribute 'split'

# Exercise 6

In [15]:
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,-0.090604,0.925753,-0.104067,0.514274,-0.856640,3.281686,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.694315,0.599396,-0.282867,-1.022077,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,b'normal.'
1,-0.090604,0.925753,-0.104067,0.514274,-0.741563,0.079652,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.600011,0.599396,-0.282867,-1.146737,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,b'normal.'
2,-0.090604,0.925753,-0.104067,0.514274,-0.749499,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.505707,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,b'normal.'
3,-0.090604,0.925753,-0.104067,0.514274,-0.781245,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.411403,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,b'normal.'
4,-0.090604,0.925753,-0.104067,0.514274,-0.785213,1.225040,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.317100,0.599396,-0.282867,-1.209067,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,b'normal.'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,-0.090604,0.925753,-0.104067,0.514274,-0.600693,1.113549,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.229844,1.196724,-0.464438,-0.426441,-0.25204,-0.249464,b'normal.'
494017,-0.090604,0.925753,-0.104067,0.514274,-0.656248,1.409373,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-0.897417,1.196724,-0.464438,-0.426441,-0.25204,-0.249464,b'normal.'
494018,-0.090604,0.925753,-0.104067,0.514274,-0.812990,0.610351,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.125961,1.196724,-0.305196,-0.426441,-0.25204,-0.249464,b'normal.'
494019,-0.090604,0.925753,-0.104067,0.514274,-0.638391,0.610351,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.167514,1.196724,-0.358277,-0.426441,-0.25204,-0.249464,b'normal.'


In [16]:
#preparing SA dataset
og=pd.DataFrame(data=X,columns=D['feature_names'])
og['target']=y
normal=og.loc[og['target']==11] #11 = b'normal.'
normal['target'].unique() #shows only normal data taken

array([11])

In [17]:
len(normal)

97278

In [18]:
anomaly_num=len(normal)*0.01
anomaly_num

972.78

In [19]:
#adding anomaly data (1%)
anomaly=og.loc[og['target']!=11] #taking data that is not normal
anomaly=anomaly.sample(973)
anomaly

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
489777,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
126343,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.609441,-1.664586,0.266287,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9
47877,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
216136,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
342144,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70000,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.637732,-1.688930,0.266287,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9
68873,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.750897,-1.810650,0.357813,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9
274076,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
333444,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18


In [20]:
#combining normal data and anomaly to make SA dataset
SA=pd.concat([normal,anomaly],ignore_index=True)
SA

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,-0.090604,0.925753,-0.104067,0.514274,-0.856640,3.281686,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.694315,0.599396,-0.282867,-1.022077,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,11
1,-0.090604,0.925753,-0.104067,0.514274,-0.741563,0.079652,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.600011,0.599396,-0.282867,-1.146737,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,11
2,-0.090604,0.925753,-0.104067,0.514274,-0.749499,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.505707,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,11
3,-0.090604,0.925753,-0.104067,0.514274,-0.781245,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.411403,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,11
4,-0.090604,0.925753,-0.104067,0.514274,-0.785213,1.225040,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.317100,0.599396,-0.282867,-1.209067,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98246,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.637732,-1.688930,0.266287,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9
98247,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.750897,-1.810650,0.357813,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9
98248,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
98249,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18


In [21]:
#change 'target' into binary classification, either normal=0 or anomaly=1
def SA_binary(val):
  if val==11:
    return 0
  else:
    return 1

SA['target']=SA['target'].apply(SA_binary)
SA

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,-0.090604,0.925753,-0.104067,0.514274,-0.856640,3.281686,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.694315,0.599396,-0.282867,-1.022077,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,0
1,-0.090604,0.925753,-0.104067,0.514274,-0.741563,0.079652,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.600011,0.599396,-0.282867,-1.146737,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,0
2,-0.090604,0.925753,-0.104067,0.514274,-0.749499,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.505707,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,0
3,-0.090604,0.925753,-0.104067,0.514274,-0.781245,0.712180,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.411403,0.599396,-0.282867,-1.188291,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,0
4,-0.090604,0.925753,-0.104067,0.514274,-0.785213,1.225040,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.317100,0.599396,-0.282867,-1.209067,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98246,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.637732,-1.688930,0.266287,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,1
98247,-0.090604,0.925753,1.594815,-1.262832,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.750897,-1.810650,0.357813,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,1
98248,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,1
98249,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,1


In [22]:
#splitting the data
SA_X_train, SA_X_test, SA_y_train, SA_y_test = train_test_split(SA.iloc[:,:-1],SA.iloc[:,[-1]],test_size=0.3, random_state=1) #train = 70%
SA_X_val, SA_X_test, SA_y_val, SA_y_test = train_test_split(SA_X_test, SA_y_test, test_size=0.5, random_state=1) #test = 15%, validation = 15%
print("SA_X_train.shape", SA_X_train.shape, "SA_y_train.shape", SA_y_train.shape)
print("SA_X_test.shape", SA_X_test.shape, "SA_y_test.shape", SA_y_test.shape)
print("SA_X_val.shape", SA_X_test.shape, "SA_y_val.shape", SA_y_test.shape)

SA_X_train.shape (68775, 41) SA_y_train.shape (68775, 1)
SA_X_test.shape (14738, 41) SA_y_test.shape (14738, 1)
SA_X_val.shape (14738, 41) SA_y_val.shape (14738, 1)


In [23]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

#isolation forest
iso_forest=IsolationForest(random_state=1)
iso_param_grid={
    'n_estimators': [200, 500, 1000],
    'max_samples': [1000, 10000, 30000],
    'contamination': [0.001, 0.01, 0.1,]
}
iso_grid_search=GridSearchCV(estimator=iso_forest, param_grid=iso_param_grid, cv=3, scoring='accuracy', verbose=3)
iso_grid_search.fit(SA_X_train,SA_y_train)
iso_best_params=iso_grid_search.best_params_
print("Best Hyperparameters:", iso_best_params)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV 1/3] END contamination=0.001, max_samples=1000, n_estimators=200;, score=0.010 total time=   1.5s
[CV 2/3] END contamination=0.001, max_samples=1000, n_estimators=200;, score=0.008 total time=   1.9s
[CV 3/3] END contamination=0.001, max_samples=1000, n_estimators=200;, score=0.010 total time=   2.3s
[CV 1/3] END contamination=0.001, max_samples=1000, n_estimators=500;, score=0.009 total time=   3.8s
[CV 2/3] END contamination=0.001, max_samples=1000, n_estimators=500;, score=0.008 total time=   3.6s
[CV 3/3] END contamination=0.001, max_samples=1000, n_estimators=500;, score=0.010 total time=   3.8s
[CV 1/3] END contamination=0.001, max_samples=1000, n_estimators=1000;, score=0.009 total time=   8.5s
[CV 2/3] END contamination=0.001, max_samples=1000, n_estimators=1000;, score=0.008 total time=   8.7s
[CV 3/3] END contamination=0.001, max_samples=1000, n_estimators=1000;, score=0.010 total time=   7.1s
[CV 1/3] END conta

In [24]:
#Obtaining the best performing model from the grid search results
iso_best_model=iso_grid_search.best_estimator_

#prediction
iso_y_pred=iso_best_model.predict(SA_X_test)

#metrics
iso_accuracy=accuracy_score(SA_y_test, iso_y_pred)
iso_accuracy

0.009499253630071923

In [25]:
#one-class svm
onec_svm=OneClassSVM()
onec_svm_param_grid={
    'nu': [0.001, 0.01, 0.1],
    'kernel':['linear','rbf','sigmoid']
}
onec_svm_grid_search=GridSearchCV(estimator=onec_svm, param_grid=onec_svm_param_grid, cv=3, scoring='accuracy', verbose=3)
onec_svm_grid_search.fit(SA_X_train,SA_y_train)
onec_svm_best_params=onec_svm_grid_search.best_params_
print("Best Hyperparameters:", onec_svm_best_params)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV 1/3] END ...........kernel=linear, nu=0.001;, score=0.000 total time=   1.2s
[CV 2/3] END ...........kernel=linear, nu=0.001;, score=0.002 total time=   1.6s
[CV 3/3] END ...........kernel=linear, nu=0.001;, score=0.000 total time=   1.7s
[CV 1/3] END ............kernel=linear, nu=0.01;, score=0.001 total time=  22.6s
[CV 2/3] END ............kernel=linear, nu=0.01;, score=0.002 total time=  28.7s
[CV 3/3] END ............kernel=linear, nu=0.01;, score=0.001 total time= 1.3min
[CV 1/3] END .............kernel=linear, nu=0.1;, score=0.000 total time=  36.4s
[CV 2/3] END .............kernel=linear, nu=0.1;, score=0.000 total time=  37.6s
[CV 3/3] END .............kernel=linear, nu=0.1;, score=0.000 total time=  36.7s
[CV 1/3] END ..............kernel=rbf, nu=0.001;, score=0.010 total time=   2.1s
[CV 2/3] END ..............kernel=rbf, nu=0.001;, score=0.008 total time=   1.3s
[CV 3/3] END ..............kernel=rbf, nu=0.001;,

In [26]:
#Obtaining the best performing model from the grid search results
onec_svm_best_model=onec_svm_grid_search.best_estimator_

#prediction
onec_svm_y_pred=onec_svm_best_model.predict(SA_X_test)

#metrics
onec_svm_accuracy=accuracy_score(SA_y_test, onec_svm_y_pred)
onec_svm_accuracy

0.009634957253358665

In [None]:
#local outlier factor
local_outlier=LocalOutlierFactor(novelty=True)
local_param_grid={
    'contamination': [0.001, 0.01, 0.1,],
    'n_neighbors': [ 500, 1000, 10000],
    'algorithm': ['ball_tree','kd_tree']
}
local_grid_search=GridSearchCV(estimator=local_outlier, param_grid=local_param_grid, cv=3, scoring='accuracy', verbose=3)
local_grid_search.fit(SA_X_train,SA_y_train)
local_best_params=local_grid_search.best_params_
print("Best Hyperparameters:", local_best_params)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py", line 279, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py", line 370, in _score
    response_method = _check_response_method(estimator, self._response_method)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 2145, in _check_response_method
    raise AttributeError(
AttributeError: LocalOutlierFactor has none of the following attributes: predict.



[CV 1/3] END algorithm=ball_tree, contamination=0.001, n_neighbors=500;, score=nan total time= 3.3min


In [None]:
#Obtaining the best performing model from the grid search results
local_best_model=local_grid_search.best_estimator_

#prediction
local_y_pred=local_best_model.predict(SA_X_test)

#metrics
local_accuracy=accuracy_score(SA_y_test, local_y_pred)
local_accuracy

In [28]:
#comparison of all anomaly models
anomaly_models=pd.DataFrame({'':'Accuracy',
                             'Isolation Forest': [iso_accuracy],
                             'One-Class SVM': [onec_svm_accuracy],
                             'Local Outlier Factor': ['N/A']})
anomaly_models

Unnamed: 0,Unnamed: 1,Isolation Forest,One-Class SVM,Local Outlier Factor
0,Accuracy,0.009499,0.009635,


# Exercsie 7

In [32]:
#SA 250 datapoints; 247 normal, 3 anomaly
normal_250=normal.sample(247)
normal_250

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
41880,0.030849,0.925753,1.964137,0.514274,0.683009,-0.034813,-0.006673,-0.048834,-0.002571,-0.049029,...,0.012583,0.307269,-0.008290,-1.229844,0.098836,-0.464438,-0.463419,-0.252040,-0.249464,11
52901,-0.090604,0.925753,1.964137,0.514274,0.206829,-0.035556,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.713556,-0.496079,-0.008290,-1.229844,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11
486424,-0.090604,0.925753,-0.104067,0.514274,-0.602677,-0.093532,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.084407,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11
7595,-0.090604,0.925753,1.964137,0.514274,2.325831,-0.036299,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.326910,0.502021,-0.191341,-1.229844,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,11
52613,-0.090604,0.925753,-0.104067,-3.039937,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.845581,0.599396,-0.282867,-1.188291,1.471196,-0.464438,-0.463419,4.084676,4.095715,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19636,-0.090604,0.925753,-0.104067,0.514274,-0.920131,4.534107,-0.006673,-0.048834,-0.002571,-0.049029,...,0.361507,0.599396,-0.282867,-0.564990,3.118028,-0.464438,-0.463419,-0.252040,-0.249464,11
5450,-0.090604,0.925753,-0.104067,0.514274,-0.779261,2.085011,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.167514,1.745668,-0.464438,-0.463419,-0.252040,-0.249464,11
345493,-0.090604,2.663056,1.225492,0.514274,-0.928067,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.769758,-1.834994,4.659518,0.640057,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,11
22567,-0.090604,0.925753,1.964137,0.514274,1.952823,-0.030353,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.770138,-0.106577,1.364595,-1.125961,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11


In [31]:
anomaly_250=anomaly.sample(3)
anomaly_250

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
174970,-0.090604,-0.81155,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
269109,-0.090604,-0.81155,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.25204,-0.249464,18
59782,-0.090604,0.925753,1.594815,-1.262832,-1.18203,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.675454,-1.737618,0.357813,-1.250621,-0.175636,2.163044,2.162021,-0.25204,-0.249464,9


In [33]:
#combining normal and anomaly data
SA_250=pd.concat([normal_250,anomaly_250],ignore_index=True)
SA_250

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0.030849,0.925753,1.964137,0.514274,0.683009,-0.034813,-0.006673,-0.048834,-0.002571,-0.049029,...,0.012583,0.307269,-0.008290,-1.229844,0.098836,-0.464438,-0.463419,-0.252040,-0.249464,11
1,-0.090604,0.925753,1.964137,0.514274,0.206829,-0.035556,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.713556,-0.496079,-0.008290,-1.229844,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11
2,-0.090604,0.925753,-0.104067,0.514274,-0.602677,-0.093532,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.084407,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11
3,-0.090604,0.925753,1.964137,0.514274,2.325831,-0.036299,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.326910,0.502021,-0.191341,-1.229844,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,11
4,-0.090604,0.925753,-0.104067,-3.039937,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.845581,0.599396,-0.282867,-1.188291,1.471196,-0.464438,-0.463419,4.084676,4.095715,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,-0.090604,0.925753,1.964137,0.514274,1.952823,-0.030353,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.770138,-0.106577,1.364595,-1.125961,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,11
246,-0.090604,0.925753,-0.104067,0.514274,-0.922115,6.606361,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.188291,0.098836,-0.464438,-0.463419,-0.252040,-0.249464,11
247,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,18
248,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,18


In [34]:
#changing to binary classification; normal=0, anomaly=1
SA_250['target']=SA_250['target'].apply(SA_binary)
SA_250

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0.030849,0.925753,1.964137,0.514274,0.683009,-0.034813,-0.006673,-0.048834,-0.002571,-0.049029,...,0.012583,0.307269,-0.008290,-1.229844,0.098836,-0.464438,-0.463419,-0.252040,-0.249464,0
1,-0.090604,0.925753,1.964137,0.514274,0.206829,-0.035556,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.713556,-0.496079,-0.008290,-1.229844,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,0
2,-0.090604,0.925753,-0.104067,0.514274,-0.602677,-0.093532,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.084407,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,0
3,-0.090604,0.925753,1.964137,0.514274,2.325831,-0.036299,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.326910,0.502021,-0.191341,-1.229844,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,0
4,-0.090604,0.925753,-0.104067,-3.039937,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.845581,0.599396,-0.282867,-1.188291,1.471196,-0.464438,-0.463419,4.084676,4.095715,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,-0.090604,0.925753,1.964137,0.514274,1.952823,-0.030353,-0.006673,-0.048834,-0.002571,-0.049029,...,-0.770138,-0.106577,1.364595,-1.125961,0.373308,-0.464438,-0.463419,-0.252040,-0.249464,0
246,-0.090604,0.925753,-0.104067,0.514274,-0.922115,6.606361,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,-1.188291,0.098836,-0.464438,-0.463419,-0.252040,-0.249464,0
247,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,1
248,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464,1


In [35]:
#splitting the data
SA250_X_train, SA250_X_test, SA250_y_train, SA250_y_test = train_test_split(SA_250.iloc[:,:-1],SA_250.iloc[:,[-1]],test_size=0.3, random_state=1) #train = 70%
print("SA250_X_train.shape", SA250_X_train.shape, "SA250_y_train.shape", SA250_y_train.shape)
print("SA250_X_test.shape", SA250_X_test.shape, "SA250_y_test.shape", SA250_y_test.shape)

SA250_X_train.shape (175, 41) SA250_y_train.shape (175, 1)
SA250_X_test.shape (75, 41) SA250_y_test.shape (75, 1)


In [40]:
SA250_X=SA_250.iloc[:,:-1]
SA250_X

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0.030849,0.925753,1.964137,0.514274,0.683009,-0.034813,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.598120,0.012583,0.307269,-0.008290,-1.229844,0.098836,-0.464438,-0.463419,-0.252040,-0.249464
1,-0.090604,0.925753,1.964137,0.514274,0.206829,-0.035556,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.366443,-0.713556,-0.496079,-0.008290,-1.229844,0.373308,-0.464438,-0.463419,-0.252040,-0.249464
2,-0.090604,0.925753,-0.104067,0.514274,-0.602677,-0.093532,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.389755,0.625558,0.599396,-0.282867,-1.084407,0.373308,-0.464438,-0.463419,-0.252040,-0.249464
3,-0.090604,0.925753,1.964137,0.514274,2.325831,-0.036299,-0.006673,-0.048834,-0.002571,-0.049029,...,-1.119321,-0.326910,0.502021,-0.191341,-1.229844,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464
4,-0.090604,0.925753,-0.104067,-3.039937,-1.182030,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.019072,-0.845581,0.599396,-0.282867,-1.188291,1.471196,-0.464438,-0.463419,4.084676,4.095715
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,-0.090604,0.925753,1.964137,0.514274,1.952823,-0.030353,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.327975,-0.770138,-0.106577,1.364595,-1.125961,0.373308,-0.464438,-0.463419,-0.252040,-0.249464
246,-0.090604,0.925753,-0.104067,0.514274,-0.922115,6.606361,-0.006673,-0.048834,-0.002571,-0.049029,...,-3.065408,0.625558,0.599396,-0.282867,-1.188291,0.098836,-0.464438,-0.463419,-0.252040,-0.249464
247,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.347967,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464
248,-0.090604,-0.811550,-0.694982,0.514274,0.778245,-0.265972,-0.006673,-0.048834,-0.002571,-0.049029,...,0.347967,0.625558,0.599396,-0.282867,0.827048,-0.175636,-0.464438,-0.463419,-0.252040,-0.249464


In [41]:
SA250_y=SA_250.iloc[:,[-1]]
SA250_y

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
245,0
246,0
247,1
248,1


In [45]:
#leave-one-out
from sklearn.model_selection import LeaveOneOut

loo=LeaveOneOut()
iso_250_predictions=[]
svm_250_predictions=[]

for train_index, test_index in loo.split(SA250_X):
    X_train, X_test=SA250_X.iloc[train_index], SA250_X.iloc[test_index]
    #isolation forest
    iso_forest.fit(X_train)
    iso_250_y_pred=iso_forest.predict(X_test)
    iso_250_predictions.append(np.where(iso_250_y_pred==-1, 1, 0))
    #one-class svm
    onec_svm.fit(X_train)
    svm_250_y_pred=onec_svm.predict(X_test)
    svm_250_predictions.append(np.where(svm_250_y_pred==-1, 1, 0))

iso250_accuracy=accuracy_score(SA250_y, iso_250_predictions)
svm250_accuracy=accuracy_score(SA250_y, svm_250_predictions)
print(f'Isolation Forest: {iso250_accuracy} \nOne-Class SVM: {svm250_accuracy}')

Isolation Forest: 0.916 
One-Class SVM: 0.504


In [47]:
#comparing ex6 and ex7
anomaly_comparisons=pd.DataFrame({'':['Isolation Forest','SVM','Local Outlier Factor'],
                                  'Original':[iso_accuracy, onec_svm_accuracy,'N/A'],
                                  'LOO':[iso250_accuracy,svm250_accuracy,'N/A']})
anomaly_comparisons

Unnamed: 0,Unnamed: 1,Original,LOO
0,Isolation Forest,0.009499,0.916
1,SVM,0.009635,0.504
2,Local Outlier Factor,,


# Exercise 8

In [50]:
#feature selection; 5 most important features

from sklearn.ensemble import RandomForestClassifier

#feature selection with Random Forest
rf=RandomForestClassifier()
rf.fit(SA250_X, SA250_y)
importances=rf.feature_importances_
indices=np.argsort(importances)[-5:]  #indices of the top 5 features

#reduced features dataset
X_selected=SA250_X.iloc[:,indices]

#models

iso_selected_predictions=[]
svm_selected_predictions=[]

for train_index, test_index in loo.split(X_selected):
    X_train, X_test=X_selected.iloc[train_index], X_selected.iloc[test_index]
    #isolation forest
    iso_forest.fit(X_train)
    iso_selected_y_pred=iso_forest.predict(X_test)
    iso_selected_predictions.append(np.where(iso_selected_y_pred==-1, 1, 0))
    #one-class svm
    onec_svm.fit(X_train)
    svm_selected_y_pred=onec_svm.predict(X_test)
    svm_selected_predictions.append(np.where(svm_selected_y_pred==-1, 1, 0))

iso_selected_accuracy=accuracy_score(SA250_y, iso_selected_predictions)
svm_selected_accuracy=accuracy_score(SA250_y, svm_selected_predictions)
print(f'Isolation Forest: {iso_selected_accuracy} \nOne-Class SVM: {svm_selected_accuracy}')


  return fit_method(estimator, *args, **kwargs)


Isolation Forest: 0.856 
One-Class SVM: 0.508


In [51]:
#comparison between ex6-8
ex6_8=pd.DataFrame({'':['Isolation Forest','SVM','Local Outlier Factor'],
                    'Original':[iso_accuracy, onec_svm_accuracy,'N/A'],
                    'LOO':[iso250_accuracy,svm250_accuracy,'N/A'],
                    'Feat Selection':[iso_selected_accuracy,svm_selected_accuracy,'N/A']
                    })
ex6_8

Unnamed: 0,Unnamed: 1,Original,LOO,Feat Selection
0,Isolation Forest,0.009499,0.916,0.856
1,SVM,0.009635,0.504,0.508
2,Local Outlier Factor,,,


Feature selection improved the one-class SVM model but did not improve the Isolation Forest model, as seen by accuracy scores above.