This directory contains Jupyter notebooks for data exploration, preprocessing, and machine learning tasks.
README.md
: This file explains the purpose and contents of the notebooks.section0_data_preprocessing_and_cleaning.ipynb
: Data decoding and sampling.section1_data_exploration_and_preprocessing.ipynb
: Data exploration and preprocessing.section2_supervised_learning_classification.ipynb
: Supervised learning for intent classification.section3_unsupervised_learning_clustering.ipynb
: Clustering SSH attack sessions.section4_language_model_exploration.ipynb
: Exploring advanced NLP techniques for the dataset.
- Decode Base64-encoded SSH sessions.
- Tokenize sessions based on a predefined vocabulary.
- Save the processed dataset for further use.
- Visualize dataset distributions.
- Clean and preprocess data for modeling.
- Perform temporal analysis and session analysis.
- Train classifiers for intent prediction.
- Evaluate performance using metrics like accuracy and F1-score.
- Perform hyperparameter tuning and feature experimentation.
- Cluster sessions to uncover hidden patterns.
- Use algorithms like k-means and Gaussian Mixture Models.
- Evaluate clusters using methods like the Elbow Method and Silhouette Analysis.
- Experiment with models like BERT and Doc2Vec.
- Fine-tune pre-trained models for intent classification.
- Plot learning curves and evaluate model performance.
- Install dependencies using
requirements.txt
. - Open the notebooks in Jupyter or JupyterLab.
- Follow the instructions provided within each notebook.
- Install Dependencies: Install necessary packages.
- Helper Functions: Functions to decode Base64 strings, clean and tokenize sessions.
- Load and Process Data: Load raw data, decode sessions, clean and tokenize, save processed data.
- Install Dependencies: Install necessary packages.
- Dataset Preparation: Load and inspect the dataset, check for missing values and duplicates.
- Temporal Analysis: Analyze attack frequencies over time (hourly, daily, monthly trends).
- Session Analysis: Analyze the content and structure of attack sessions.
- Intent Distribution: Visualize the distribution of different attack intents.
- Text Representation: Represent text data using techniques like TF-IDF.
- Error Reporting: Report any errors encountered during preprocessing.
- Install Dependencies: Install necessary packages.
- Data Splitting: Split the dataset into training and testing sets.
- Baseline Model Implementation: Implement baseline classifiers like Logistic Regression, Random Forest, and SVM.
- Hyperparameter Tuning: Tune hyperparameters using GridSearchCV.
- Result Analysis: Analyze the performance of the models.
- Feature Experimentation: Experiment with different feature representations and preprocessing techniques.
- Install Dependencies: Install necessary packages.
- Determine the Number of Clusters: Use Elbow Method and Silhouette Analysis to find the optimal number of clusters.
- Tune Other Hyperparameters: Tune hyperparameters for clustering algorithms.
- Visualize the Clusters: Visualize the resulting clusters.
- Cluster Analysis: Analyze the characteristics of each cluster.
- Assess Homogeneity and Intent Reflection: Assess how well the clusters reflect different attack intents.
- Associate Clusters with Specific Attack Categories: Link clusters to specific categories of attacks.
- Install Dependencies: Install necessary packages.
- Model Selection: Choose between BERT and Doc2Vec for intent classification.
- Add a Dense Layer: Extend the model with a custom dense layer for classification.
- Fine-Tuning: Fine-tune the model on the training dataset.
- Plot Learning Curves: Visualize training and validation loss curves to identify the optimal stopping point.