LINDEF is a machine learning-based network intrusion detection project designed to identify suspicious network traffic while staying lightweight enough for local environments, school networks, small businesses, and other organizations with limited cybersecurity resources.
The system uses a two-stage classification pipeline:
- A binary detection model classifies traffic as benign or malicious.
- An attack classification model identifies the likely attack type when suspicious traffic is detected.
This repository contains the training code, Colab notebook, Streamlit dashboard, benchmark results, documentation, and simulation dashboard demo for the LINDEF project.
Traditional intrusion detection systems can be expensive, difficult to maintain, resource-heavy, or dependent on constantly updated rules and signatures. LINDEF explores whether a lightweight machine learning system can provide strong intrusion detection performance while tracking practical deployment factors such as:
- Accuracy
- Precision
- Recall
- F1-score
- False positive rate
- Average latency
- RAM usage
- Model size
The goal is not only to detect attacks accurately, but also to evaluate whether the system could realistically operate in lower-resource environments.
LINDEF was designed around three main goals:
- Detect malicious traffic accurately
- Classify the type of attack when possible
- Remain lightweight enough for practical local use
The project also includes response mapping, where predicted attack types are assigned severity levels and recommended containment actions.
LINDEF uses a two-stage detection process.
The binary model classifies each network flow as:
BENIGN
ATTACK
This model acts as the first detection layer. If traffic is classified as benign, the system allows it. If traffic is classified as suspicious, it is passed into the attack classification model.
The attack classification model predicts the likely attack type for malicious traffic. Example attack labels include:
neptune
smurf
nmap
portsweep
ipsweep
guess_passwd
httptunnel
warezmaster
apache2
The predicted attack type is then mapped to a severity level and a recommended response.
Example response mapping:
| Attack Category | Example Attacks | Example Response |
|---|---|---|
| DoS-style attacks | neptune, smurf, apache2 |
BLOCK_IP |
| Scanning/probe attacks | nmap, portsweep, ipsweep |
BLOCK_IP |
| Credential/access attempts | guess_passwd, warezmaster |
THROTTLE_IP |
| Tunneling or host compromise | httptunnel, rootkit |
ISOLATE_HOST |
| Normal traffic | normal, benign |
ALLOW |
LINDEF combines multiple public intrusion detection datasets to improve attack coverage and reduce overfitting to a single dataset.
| Dataset | Description |
|---|---|
| NSL-KDD | Benchmark intrusion detection dataset containing normal traffic and multiple attack categories |
| UNSW-NB15 | Modern intrusion detection dataset containing normal traffic and nine attack categories |
| CIC-IDS | Flow-based intrusion detection dataset containing benign traffic and multiple attack types |
Combining these datasets provides broader attack coverage, more training records, and a more diverse feature space.
The training pipeline performs the following steps:
- Loads NSL-KDD, UNSW-NB15, and CIC-IDS data
- Combines all datasets into one training set
- Removes invalid, missing, infinite, and duplicate values
- Drops leakage-prone columns such as IP addresses, timestamps, and flow identifiers
- Creates binary and multi-class labels
- Encodes categorical features
- Adds a packet-rate feature when flow columns are available
- Scales features using
StandardScaler - Applies SMOTE to reduce class imbalance
- Trains Random Forest models
- Evaluates detection and classification performance
- Saves model artifacts for dashboard inference
LINDEF trains two Random Forest classifiers:
| Model | Purpose |
|---|---|
| Binary Random Forest Model | Detects whether traffic is normal or malicious |
| Attack Classification Random Forest Model | Classifies the likely attack type after traffic is flagged as malicious |
Generated model artifacts include:
binary_model.pkl
class_model.pkl
scaler.pkl
feature_columns.pkl
feature_medians.npy
labelEncoder.pkl
These artifacts are not included in the repository by default because some model files may be too large for normal GitHub upload. They can be regenerated by running the training notebook.
LINDEF produced strong performance in both binary detection and attack classification.
| Model | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Binary Classification Model | 0.9983 | 0.9983 | 0.9983 | 0.9983 |
| Attack Classification Model | 0.9415 | 0.9416 | 0.9415 | 0.9415 |
Additional performance metrics:
| Metric | Binary Model | Attack Classification Model |
|---|---|---|
| False Positive Rate | 0.21% | 0.37% |
| Average Latency | 54.18 ms | 53.79 ms |
| RAM Usage | 18.52 MB | 8.32 MB |
| ROC-AUC | 0.97 | 0.89 |
The binary model performed especially strongly, while the attack classification model showed strong multi-class performance across a broader set of attack labels.
LINDEF was compared against common intrusion detection approaches. The LINDEF metrics come from project testing, while the non-LINDEF values are literature-based comparison ranges summarized for context.
| Method | Detection Task | Accuracy | False Positive Rate | Average Latency | RAM Usage |
|---|---|---|---|---|---|
| LINDEF Binary Random Forest | Normal vs. attack detection | 99.83% | 0.21% | 54.18 ms | 18.52 MB |
| LINDEF Attack Classification Random Forest | Attack type classification | 94.15% | 0.37% | 53.79 ms | 8.32 MB |
| Signature-Based IDS | Known attack pattern matching | 94%–98% | <1% | <5 ms | 50–200 MB |
| Anomaly-Based IDS | Detects deviations from normal behavior | 85%–95% | 3%–5% | 10–50 ms | 1–4 GB |
| Cloud-Based Endpoint Detection | Endpoint and device monitoring | 96%–99% | 2% | 50–200 ms | Varies |
| Rule-Based IDS | Manually written detection rules | 90%–95% | 1%–2% | 5–15 ms | 200–500 MB |
LINDEF shows a strong balance of high accuracy, low false positive rate, and low RAM usage. Its latency is higher than some traditional IDS approaches, but the tradeoff is reasonable for the intended use case of lightweight monitoring in smaller or moderate-traffic environments.
This repository includes a screen-recorded dashboard demo named:
lindef_simulation_dashboard
The demo shows the Streamlit dashboard processing sample flow data generated from the training/testing pipeline. It demonstrates the detection pipeline without requiring a full live packet capture setup.
The demo shows LINDEF:
- Loading trained model artifacts
- Processing sample flow-style records
- Aligning data to the expected training features
- Scaling input features
- Classifying traffic as benign or malicious
- Predicting attack types
- Assigning severity levels
- Recommending response actions
- Logging recent detections
- Displaying dashboard charts
This demo should be interpreted as a simulation dashboard demo using sample flow data, not as a fully deployed live-network environment.
The dashboard is designed to support a future live workflow using:
TShark packet capture
CICFlowMeter feature extraction
LINDEF model inference
Streamlit visualization
If the demo video is included in this repository, it should be placed in:
demo/lindef_simulation_dashboard.mp4
LINDEF/
├── README.md
├── requirements.txt
├── .gitignore
│
├── src/
│ └── train_lindef_models.py
│
├── notebooks/
│ └── LINDEF_training_colab.ipynb
│
├── app/
│ └── dashboard.py
│
├── data/
│ └── README.md
│
├── models/
│ └── README.md
│
├── results/
│ ├── README.md
│ ├── benchmark_results.csv
│ ├── benchmark_results.md
│ ├── confusion_matrices.png
│ ├── binary_roc.png
│ └── multiclass_roc.png
│
├── demo/
│ ├── README.md
│ └── lindef_simulation_dashboard.mp4
│
└── docs/
├── methodology.md
├── limitations.md
├── future_work.md
└── LINDEF_poster.pdf
Clone the repository:
git clone https://github.com/YOUR-USERNAME/LINDEF.git
cd LINDEFInstall dependencies:
pip install -r requirements.txtThe easiest way to train the models is through the Colab notebook:
notebooks/LINDEF_training_colab.ipynb
The notebook generates:
binary_model.pkl
class_model.pkl
scaler.pkl
feature_columns.pkl
feature_medians.npy
labelEncoder.pkl
simulation_test.csv
confusion_matrices.png
binary_roc.png
multiclass_roc.png
The generated model files should be placed locally in:
models/
Model files are not included by default because some artifacts may be too large for normal GitHub upload.
After generating the model artifacts, place them locally in the models/ folder.
Expected local files:
models/binary_model.pkl
models/class_model.pkl
models/scaler.pkl
models/feature_columns.pkl
models/feature_medians.npy
Then run:
streamlit run app/dashboard.pyThe dashboard can run in simulation mode using sample flow data. A future version will expand live capture support using TShark and CICFlowMeter.
The full datasets are not included in this repository because of file size and licensing constraints.
The training pipeline expects the following files:
NSL-KDD(training) - KDDTrain+.csv
NSL-KDD(testing) - KDDTest+.csv
UNSW-NB15 (training) - UNSW_NB15_training-set (1).csv
UNSW-NB15 (testing) - UNSW_NB15_testing-set (1).csv
CIC-IDS CSV files or CIC-IDS zip file
More details are provided in:
data/README.md
A general project poster summarizing the background, methodology, results, benchmark comparison, limitations, and future work is included in:
docs/LINDEF_Project_Poster.pdf
This poster is included as a general LINDEF project summary, not only as a competition-specific poster.
LINDEF has several limitations that should be considered before real-world deployment:
- Public IDS datasets may not fully represent live enterprise traffic.
- CICFlowMeter does not directly recreate every NSL-KDD or UNSW-NB15 feature from raw packet captures.
- Live detection accuracy depends on how closely extracted features match the training feature space.
- Some attack classes have fewer samples than others.
- SMOTE helps class imbalance but does not fully replace real examples of rare attacks.
- The current dashboard demo uses sample data rather than a fully deployed live network environment.
- Benchmark values for non-LINDEF methods are literature-based comparison ranges, not direct same-hardware tests.
- The model may struggle with zero-day attacks or adversarial traffic designed to evade detection.
- Additional testing is needed in higher-bandwidth and larger network environments.
Future improvements include:
- Improving live feature extraction with CICFlowMeter or NFStream
- Testing LINDEF on larger and higher-bandwidth networks
- Expanding training data with newer intrusion detection datasets
- Adding ensemble models such as Random Forest plus XGBoost or LightGBM
- Tuning probability thresholds to reduce false positives
- Adding explainability tools such as feature importance or SHAP
- Improving containment actions based on attack type, confidence, and repeated behavior
- Adding Docker support for easier setup
- Directly benchmarking LINDEF against Snort, Suricata, Zeek, and endpoint detection tools on the same hardware
- Improving the Streamlit dashboard for more reliable live monitoring
LINDEF uses:
- Python
- pandas
- NumPy
- scikit-learn
- imbalanced-learn
- Matplotlib
- Seaborn
- Joblib
- psutil
- Streamlit
- TShark/Wireshark
- CICFlowMeter
LINDEF is a research prototype and science fair project. It demonstrates that a lightweight machine learning pipeline can achieve strong intrusion detection results while maintaining relatively low memory usage. Additional live-network testing is needed before real-world deployment.
Sanjay Balaji