# FEATURE SELECTION USING MUTUAL INFORMATION

Mutual information (MI) is preferred for feature selection in KNN modeling because it can capture both linear and nonlinear relationships between features and the target. It's robust, doesn't assume a specific data distribution, and selects features based on how much information they provide for classification. MI is a versatile choice for KNN and is suitable for datasets with complex relationships.

Loading the Dataset: The code begins by importing the necessary libraries and loading the dataset from a CSV file using Pandas.

Defining Features and Target: It defines the features (X) and the target variable (y) for your machine learning task. In this case, it excludes the 'id' and 'label' columns from the dataset as they are not considered as features.

Identifying Numerical and Categorical Columns:

Numerical columns are those that contain continuous or discrete numeric values.
Categorical columns are those that contain categorical or text-based values.
Preprocessing for Numerical Features:

It sets up a preprocessing pipeline for numerical features.
The pipeline includes:
SimpleImputer: Imputes missing values with the median of each numerical column.
StandardScaler: Standardizes (scales) the numerical features to have a mean of 0 and a standard deviation of 1.
Preprocessing for Categorical Features:

It sets up a preprocessing pipeline for categorical features.
The pipeline includes:
SimpleImputer: Imputes missing values with the most frequent value in each categorical column.
OneHotEncoder: Performs one-hot encoding to convert categorical values into binary (0 or 1) format.
Combining Preprocessing Steps:

It uses the ColumnTransformer to combine the preprocessing steps for both numerical and categorical columns.
This ensures that all features are processed correctly before feature selection.
Feature Selection:

It specifies the number of features to select (num_features_to_select) based on your requirements.
It checks if the specified number of features is not greater than the total number of available features.
It uses SelectKBest with the f_classif score function to select the top features.
The fit_transform method is called on the combined preprocessing and feature selection steps to perform feature selection on the dataset.
It then retrieves a mask of selected features using selector.get_support().
Getting Selected Feature Names:

It constructs a list of all feature names, including numerical and one-hot encoded categorical features.
It filters this list to include only the names of the selected features based on the mask obtained in the previous step.
Displaying Selected Features:

Finally, it prints the names of the selected features to the console.
This code effectively prepares your dataset by preprocessing both numerical and categorical features, selects the top K features based on mutual information, and displays the names of those selected features. 

In [3]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load your dataset (assuming it's in a CSV file)
data = pd.read_csv('C:/Users/shaik/Downloads/USNW (1).csv')
data

# Define features (X) and target (y)
X = data.drop(['id', 'label'], axis=1)  # Exclude 'id' and 'label' columns
y = data['label']

# Define numerical and categorical columns
numerical_cols = ['dur', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload',
                  'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
                  'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len',
                  'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm',
                  'ct_dst_src_ltm', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports']
categorical_cols = ['proto', 'service', 'state']

# Apply preprocessing to numerical columns
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Replace missing values with median
    ('scaler', StandardScaler())  # Standardize the numerical features
])

# Apply one-hot encoding to categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Replace missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

# Combine preprocessing steps for numerical and categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Combine preprocessing with feature selection
num_features_to_select = 10  # Adjust this based on your needs, but ensure it's less than the total number of available features
if num_features_to_select > X.shape[1]:
    num_features_to_select = X.shape[1]

selector = SelectKBest(score_func=f_classif, k=num_features_to_select)

# Fit the preprocessing and feature selection steps
X_preprocessed = preprocessor.fit_transform(X)
selector.fit(X_preprocessed, y)

# Get the mask of selected features
selected_feature_indices = selector.get_support()

# Get the names of the selected features
all_feature_names = numerical_cols + list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names(categorical_cols))
selected_features = [all_feature_names[i] for i, is_selected in enumerate(selected_feature_indices) if is_selected]

print("Selected Features:")
for feature in selected_features:
    print(feature)


Selected Features:
rate
sttl
dload
swin
dmean
ct_state_ttl
ct_dst_sport_ltm
proto_tcp
state_CON
state_INT




# K-NEAREST NEIGHBORS

Importing Necessary Libraries:

The code begins by importing the required libraries, including train_test_split for splitting the dataset, KNeighborsClassifier for creating the KNN model, and accuracy_score for evaluating model accuracy.
Splitting the Data:

It uses train_test_split to split the preprocessed dataset into training and testing sets.
The test_size parameter specifies the proportion of data to be used for testing (in this case, 20%).
The random_state parameter is set for reproducibility.
Creating the KNN Model:

It initializes a KNN classifier by specifying the number of neighbors (k) to consider when making predictions.
You can adjust the value of k based on your needs. A larger k considers more neighbors, while a smaller k focuses on fewer neighbors.
Fitting the KNN Model:

The KNN model is trained (fitted) using the training data (X_train and y_train) to learn the underlying patterns in the data.
Making Predictions:

After training, the model is used to make predictions on the test set (X_test).
The predictions are stored in the y_pred variable.
Evaluating Model Accuracy:

It calculates the accuracy of the KNN model using the accuracy_score function.
The accuracy score is a measure of how well the model's predictions match the actual target values.
The result is printed to the console to show the accuracy of the KNN model on the test data.
In summary, this code demonstrates how to split a preprocessed dataset into training and testing sets, create and train a KNN classifier, use it to make predictions, and evaluate the model's accuracy. 

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed[:, selected_feature_indices], y, test_size=0.2, random_state=42)


In [6]:
# Create and fit the KNN model
k = 10  # You can adjust the number of neighbors (k) as needed
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)


KNeighborsClassifier(n_neighbors=10)

In [7]:
# Make predictions on the test set
y_pred = knn_model.predict(X_test)


In [8]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of KNN model: {accuracy:}")

Accuracy of KNN model: 0.93


Accuracy: Accuracy is a common evaluation metric for classification models. It measures the proportion of correctly classified instances (or data points) out of the total number of instances in the test dataset.

0.93: The value "0.93" represents the accuracy score, which is a decimal number between 0 and 1. In this case, it means that the KNN model correctly classified approximately 93% of the instances in the test dataset.

The accuracy score is calculated as:

Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)

In this case, the KNN model made correct predictions for about 93% of the instances in the test dataset.

A higher accuracy score indicates that the model is performing well in terms of classifying the data correctly. In this context, an accuracy of 0.93 is generally considered quite good, as it means the model is accurate in its predictions for a large majority of the test instances.

However, it's essential to consider the specific problem and dataset when interpreting accuracy. In some cases, other metrics such as precision, recall, or F1-score may also be important, especially when dealing with imbalanced datasets or specific requirements of the problem.

# FEATURE SELECTION FOR RANDOM FOREST

In [9]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Load your dataset (assuming it's in a CSV file)
data = pd.read_csv('C:/Users/shaik/Downloads/USNW (1).csv')

# Define the target variable ('label')
target = data['label']

# Define the features (exclude 'id', 'attack_cat', and 'label' columns)
X = data.drop(['id', 'attack_cat', 'label'], axis=1)

# Define categorical columns (if any)
categorical_cols = ['proto', 'service', 'state']

# Label encode categorical columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

# Initialize a Random Forest classifier
random_forest_model = RandomForestClassifier(random_state=42)

# Fit the model on the entire dataset
random_forest_model.fit(X, target)

# Get feature importances
feature_importances = random_forest_model.feature_importances_

# Create a DataFrame to store feature names and their importances
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top features based on importance in the desired format
top_n = 10  # Change this to the desired number of top features
selected_features = feature_importance_df['Feature'][:top_n].tolist()
print("Selected Features:")
for feature in selected_features:
    if '_' in feature:
        category_value = feature.split('_')
        category = category_value[0]
        value = '_'.join(category_value[1:])
        print(f"{category}: {value}")
    else:
        print(feature)


Selected Features:
sttl
ct: state_ttl
dload
rate
sload
dttl
dmean
ackdat
smean
ct: srv_dst


# RANDOM FOREST MODELLING

In [10]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load your dataset (assuming it's in a CSV file)
data = pd.read_csv('C:/Users/shaik/Downloads/USNW (1).csv')

# Define the target variable ('label')
target = data['label']

# Define the selected features (based on importance)
selected_features = [
    'sttl',
    'ct_state_ttl',
    'dload',
    'rate',
    'sload',
    'dttl',
    'dmean',
    'ackdat',
    'smean',
    'ct_srv_dst'
]

# Extract the selected features from the dataset
X = data[selected_features]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=42)

# Initialize a Random Forest classifier
random_forest_model = RandomForestClassifier(random_state=42)

# Fit the model on the training data
random_forest_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = random_forest_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Random Forest Model Performance:")
print("Accuracy:", accuracy)
print("Classification Report:\n", report)


Random Forest Model Performance:
Accuracy: 0.9523225640879409
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.90      0.92     11169
           1       0.96      0.98      0.97     23900

    accuracy                           0.95     35069
   macro avg       0.95      0.94      0.94     35069
weighted avg       0.95      0.95      0.95     35069



# FEATURE SELECTION FOR DECISION TREE

In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif

# Load your dataset (assuming it's in a CSV file)
data = pd.read_csv('C:/Users/shaik/Downloads/USNW (1).csv')

# Define the target variable ('label')
target = data['label']

# Define the features (exclude 'id', 'attack_cat', and 'label' columns)
X = data.drop(['id', 'attack_cat', 'label'], axis=1)

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Apply label encoding to categorical columns (or you can use one-hot encoding)
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

# Apply feature selection (SelectKBest with f_classif score)
num_features_to_select = 10  # Adjust the number of features to select
selector = SelectKBest(score_func=f_classif, k=num_features_to_select)
X_selected = selector.fit_transform(X, target)

# Get the names of the selected features
selected_feature_indices = selector.get_support(indices=True)
selected_features = X.columns[selected_feature_indices].tolist()

# Display the selected feature names
print("Selected Features:", selected_features)


Selected Features: ['state', 'rate', 'sttl', 'dload', 'swin', 'dwin', 'dmean', 'ct_state_ttl', 'ct_src_dport_ltm', 'ct_dst_sport_ltm']


# DECISION TREE MODELLING

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, target, test_size=0.2, random_state=42)

# Initialize a Decision Tree classifier
decision_tree_model = DecisionTreeClassifier(random_state=42)

# Fit the model on the training data
decision_tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = decision_tree_model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Decision Tree Model Performance:")
print("Accuracy:", accuracy)
print("Classification Report:\n", report)


Decision Tree Model Performance:
Accuracy: 0.9258889617610996
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.87      0.88     11169
           1       0.94      0.95      0.95     23900

    accuracy                           0.93     35069
   macro avg       0.92      0.91      0.91     35069
weighted avg       0.93      0.93      0.93     35069

