# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [7]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [9]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [11]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [13]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [15]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [17]:
len(np.unique(D["target"]))

23

In [19]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

****Exercise 1****

In [1]:

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Input
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from tensorflow.keras.optimizers import Adam
import xgboost as xgb

In [21]:
#defining X
X=pd.DataFrame(data=D['data'],columns=D['feature_names'])

#encoding
en=LabelEncoder()
convert=X.select_dtypes(include=['object']).columns.tolist()

for col in convert:
    le = LabelEncoder()
    X[col] = en.fit_transform(X[col])

X

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,1,22,9,164,4773,0,0,0,0,...,9,9,100,0,11,0,0,0,0,0
1,0,1,22,9,222,465,0,0,0,0,...,19,19,100,0,5,0,0,0,0,0
2,0,1,22,9,218,1316,0,0,0,0,...,29,29,100,0,3,0,0,0,0,0
3,0,1,22,9,202,1316,0,0,0,0,...,39,39,100,0,3,0,0,0,0,0
4,0,1,22,9,200,2006,0,0,0,0,...,49,49,100,0,2,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,1,22,9,293,1856,0,0,0,0,...,86,255,100,0,1,5,0,1,0,0
494017,0,1,22,9,265,2254,0,0,0,0,...,6,255,100,0,17,5,0,1,0,0
494018,0,1,22,9,186,1179,0,0,0,0,...,16,255,100,0,6,5,6,1,0,0
494019,0,1,22,9,274,1179,0,0,0,0,...,26,255,100,0,4,5,4,1,0,0


In [23]:
#defining y
y=D['target']

#encoding
encoder=LabelEncoder()
y=encoder.fit_transform(y)
y

array([11, 11, 11, ..., 11, 11, 11])

In [25]:
#checking dataset
df=pd.DataFrame(data=X,columns=D['feature_names'])
df['target']=D['target']
df.describe(include='all')

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
count,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,...,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021
unique,,,,,,,,,,,...,,,,,,,,,,23
top,,,,,,,,,,,...,,,,,,,,,,b'smurf.'
freq,,,,,,,,,,,...,,,,,,,,,,280790
mean,11.936043,0.467132,23.408894,7.842446,595.755899,357.837126,4.5e-05,0.004469,1.4e-05,0.028673,...,188.66567,75.37797,3.090573,60.193476,0.639906,17.49939,12.532269,5.811761,5.741167,
std,131.739314,0.575606,13.538332,2.250853,504.011413,1345.396019,0.006673,0.091523,0.00551,0.584815,...,106.040437,41.078098,10.925911,48.130925,3.643362,37.678711,27.043113,23.058951,23.014032,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.0,0.0,14.0,9.0,39.0,0.0,0.0,0.0,0.0,0.0,...,46.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
50%,0.0,0.0,14.0,9.0,491.0,0.0,0.0,0.0,0.0,0.0,...,255.0,100.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,
75%,0.0,1.0,42.0,9.0,988.0,0.0,0.0,0.0,0.0,0.0,...,255.0,100.0,4.0,100.0,0.0,0.0,0.0,0.0,0.0,


In [27]:
#splitting the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1) #train = 70%
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=1) #test = 15%, validation = 15%
print("X_train.shape", X_train.shape, "y_train.shape", y_train.shape)
print("X_test.shape", X_test.shape, "y_test.shape", y_test.shape)
print("X_val.shape", X_test.shape, "y_val.shape", y_test.shape)

X_train.shape (345814, 41) y_train.shape (345814,)
X_test.shape (74104, 41) y_test.shape (74104,)
X_val.shape (74104, 41) y_val.shape (74104,)


In [29]:
#standardizing
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [None]:
#function to create a NN model
def create_nn(reg=tf.keras.regularizers.l2(0.1), learning_rate=0.001):
    model=keras.Sequential(
        [
            Input(shape=(X_train.shape[1],)),
            Dense(100, activation='relu',kernel_regularizer=reg),
            Dense(50, activation='relu',kernel_regularizer=reg),
            Dense(len(np.unique(y_train)), activation='softmax')
        ]
    )
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

#hyperparameter tuning for NN
nn_results=[]
for reg in [tf.keras.regularizers.l1(0.1),tf.keras.regularizers.l2(0.1)]: #hyperparameter 1: regularization
    for learning_rate in [0.001,0.01]: #hyperparameter 2: learning rate
        model = create_nn(reg, learning_rate)
        model.fit(X_train, y_train, epochs=10, verbose=0)
        loss, accuracy = model.evaluate(X_val, y_val, verbose=0)
        nn_results.append((reg, learning_rate, accuracy))

#finding best model (highest accuracy)
best_nn=max(nn_results, key=lambda x: x[2])
best_nn

****Exercise 2****

In [None]:
ensemble_size = 25
ensemble = []
for model in range(ensemble_size):
    rfc = RandomForestClassifier()
    rfc.fit(X_train,y_train)
    ensemble.append(rfc)

predictions = np.zeros((X_test.shape[0],ensemble_size))
for model, rfc in enumerate(ensemble):
    predictions[:,model] = rfc.predict(X_test)

uncertainty = np.std(predictions, axis=1)

num_points = X_test.shape[0]
top_10_percent_indices = np.argsort(uncertainty)[-int(0.1 * num_points):]
bottom_10_percent_indices = np.argsort(uncertainty)[:int(0.1 * num_points)]


print("Top 10% of the data points in terms of uncertainty:")
print(top_10_percent_indices)

print("Bottom 10% of the data points in terms of uncertainty:")
print(bottom_10_percent_indices)

****Exercise 3****

Using RFE and FI from random forests

In [None]:

from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report

# RFE
model_rfe = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=model_rfe, n_features_to_select=10)
rfe.fit(X_train, y_train)
rfe_features = np.array(X.columns)[rfe.support_]

# rf feature importance
model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(X_train, y_train)
importances = model_rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]
rf_features = np.array(X.columns)[indices]

In [None]:
# combine selected features
combined_features = np.unique(np.concatenate((rfe_features, rf_features)))

# train classifiers w/ selected features
X_train_rfe = X_train[rfe_features]
X_test_rfe = X_test[rfe_features]

X_train_rf = X_train[rf_features]
X_test_rf = X_test[rf_features]

X_train_combined = X_train[combined_features]  
X_test_combined = X_test[combined_features]

In [None]:
#RFE
model_rfe_final = RandomForestClassifier(random_state=42)
model_rfe_final.fit(X_train_rfe, y_train)
y_pred_rfe = model_rfe_final.predict(X_test_rfe)
print("\n\nFeatures selected by RFE:", rfe_features)
print("\nPerformance using RFE selected features:")
print(classification_report(y_test, y_pred_rfe))
print("\nAccuracy:", accuracy_score(y_test, y_pred_rfe))


# feature importance
model_rf_final = RandomForestClassifier(random_state=42)
model_rf_final.fit(X_train_rf, y_train)
y_pred_rf = model_rf_final.predict(X_test_rf)
print("\n\nFeatures selected by Random Forest:", rf_features)
print("\nPerformance using Random Forest feature importance selected features:")
print(classification_report(y_test, y_pred_rf))
print("\nAccuracy:", accuracy_score(y_test, y_pred_rf))

# combined 
model_combined_final = RandomForestClassifier(random_state=42)
model_combined_final.fit(X_train_combined, y_train)
y_pred_combined = model_combined_final.predict(X_test_combined)
print("\n\nCombined selected features:", combined_features)
print("\nPerformance using combined features from RFE and Random Forest feature importance:")
print(classification_report(y_test, y_pred_combined))
print("\nAccuracy:", accuracy_score(y_test, y_pred_combined))

****Exercise 4****