CTI ML Project

Overview

The CTI ML Project is a Python-based machine learning application designed to support the Cyber Threat Intelligence (CTI) lifecycle. It focuses on analyzing structured CTI data to classify threats (e.g., benign vs. malicious activities) using a Random Forest Classifier. The project includes data preprocessing, model training, evaluation, and prediction capabilities, aligning with CTI lifecycle stages: data collection, processing, analysis, and dissemination.

This project is intended for CTI specialists and security analysts who want to prototype ML-driven threat analysis. It is built with extensibility in mind, allowing integration with real-world CTI data sources (e.g., STIX/TAXII feeds).

Features

Data Preprocessing: Handles structured CTI data, including IP address parsing, categorical encoding, and numerical scaling.
Machine Learning: Uses a Random Forest Classifier to classify threats based on features like IP addresses, domains, ports, and timestamps.
CTI Lifecycle Support: Implements data loading, preprocessing, training, and prediction to support the full CTI workflow.
Error Handling: Includes robust logging and error management for debugging and operational reliability.
Extensibility: Modular design allows easy integration with larger datasets or additional ML models.

Requirements

Python 3.8+
Required libraries:
- pandas
- numpy
- scikit-learn

You can install dependencies using pip:

pip install pandas numpy scikit-learn

Installation

Clone or download the project repository:

git clone <repository-url>
cd cti-ml-project

Install the required Python packages:
```
pip install -r requirements.txt
```
Alternatively, install dependencies manually as listed above.
Ensure the cti_ml_project.py script is in your working directory.

Usage

The project is implemented as a Python class (CTIMLProject) that can be used programmatically or via the provided example.

Example

The script includes a sample dataset and usage example. To run it:

python cti_ml_project.py

This will:

Load a sample CTI dataset with features: IP address, domain, port, timestamp, and label (benign/malicious).
Preprocess the data (e.g., split IP addresses into octets, encode categorical features).
Train a Random Forest Classifier.
Output model accuracy and a classification report.
Predict the threat level for a new data point.

Sample Code

from cti_ml_project import CTIMLProject

# Sample CTI data
sample_data = {
    'ip_address': ['192.168.1.1', '10.0.0.2', '172.16.0.3', '192.168.1.4'],
    'domain': ['example.com', 'malicious.com', 'test.com', 'evil.com'],
    'port': [80, 443, 8080, 22],
    'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'label': ['benign', 'malicious', 'benign', 'malicious']
}

# Initialize and run the project
cti_project = CTIMLProject()
results = cti_project.run_cti_lifecycle(sample_data)
print("Results:", results)

# Predict a new threat
new_threat = {
    'ip_address': '192.168.1.5',
    'domain': 'unknown.com',
    'port': 443,
    'timestamp': '2023-01-05'
}
prediction = cti_project.predict_threat(new_threat)
print("Predicted threat label:", prediction[0])

Output

The script logs key steps and outputs:

Model accuracy and classification report.
Predicted threat labels for new data. Example log:

2025-04-17 12:56:44,559 - INFO - CTI ML Project initialized.
2025-04-17 12:56:44,560 - INFO - Data loaded with shape: (4, 8)
2025-04-17 12:56:44,564 - INFO - Data preprocessed successfully.
2025-04-17 12:56:44,652 - INFO - Model trained. Accuracy: 1.00
2025-04-17 12:56:44,653 - INFO - Classification Report:
              precision    recall  f1-score   support
      benign       1.00      1.00      1.00         2
   malicious       1.00      1.00      1.00         2
    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4
2025-04-17 12:56:44,653 - INFO - Predicted threat label: benign

Notes

Small Dataset: The included sample data (4 rows) is for demonstration only. For production, use a larger dataset (e.g., thousands of samples) from CTI sources like threat intelligence feeds.
IP Address Handling: IP addresses are split into octets for numerical processing. For advanced use, consider feature engineering (e.g., geolocation, subnet analysis).
Model Limitations: The Random Forest Classifier is suitable for structured data but may not handle unstructured CTI data (e.g., logs, packet captures). Explore anomaly detection or deep learning for such cases.
Security: Ensure CTI data is handled securely (e.g., encrypted storage, access controls) as it may contain sensitive information.

Future Improvements

Integrate with CTI data sources (e.g., STIX/TAXII, MISP).
Support additional ML models (e.g., Isolation Forest for anomaly detection, neural networks for unstructured data).
Add visualization for threat analysis (e.g., matplotlib plots of feature importance).
Implement real-time prediction APIs for integration with security operations centers (SOCs).

Contact

For questions or contributions, please contact the project maintainer or open an issue on the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
cti_ml_project.py		cti_ml_project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CTI ML Project

Overview

Features

Requirements

Installation

Usage

Example

Sample Code

Output

Notes

Future Improvements

Contact

About

Uh oh!

Releases

Packages

Languages

ties2/CTI-ML

Folders and files

Latest commit

History

Repository files navigation

CTI ML Project

Overview

Features

Requirements

Installation

Usage

Example

Sample Code

Output

Notes

Future Improvements

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages