An end-to-end machine learning system for detecting phishing websites using network security data. Built with production-grade MLOps practices including automated training pipelines, experiment tracking, and real-time inference API.
- Production-Ready ML Pipeline: Modular architecture with data ingestion, validation, transformation, and training components
- Model Performance: 97.6% F1-score on test data with ensemble learning (XGBoost, Random Forest, Gradient Boosting)
- MLOps Integration: Experiment tracking with MLflow, model versioning, and automated retraining capabilities
- RESTful API: FastAPI-based inference service with Swagger documentation
- Data Quality Assurance: Automated data validation and drift detection using statistical tests
- Scalable Design: Configuration-driven architecture supporting multiple environments
| Metric | Train | Test |
|---|---|---|
| F1 Score | 0.991 | 0.976 |
| Precision | 0.987 | 0.966 |
| Recall | 0.994 | 0.985 |
MongoDB β Data Ingestion β Data Validation β Feature Engineering β Model Training β Model Registry
β β β β β β
Raw Data CSV Export Schema/Drift Check Preprocessing GridSearchCV MLflow Tracking
API Request β File Upload β Data Preprocessing β Model Prediction β JSON Response
β β β β β
FastAPI CSV/Excel Saved Preprocessor Trained Model Predictions
- Data Ingestion: Automated data extraction from MongoDB with connection pooling
- Data Validation:
- Schema validation (31 numerical features)
- Column presence checks
- Data drift detection using Kolmogorov-Smirnov test
- Automated drift reports generation
- Data Transformation:
- Feature scaling using StandardScaler
- Robust preprocessing pipeline
- Saved transformers for inference consistency
- Model Training:
- 7 ML algorithms comparison (Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Gradient Boosting, XGBoost)
- Automated hyperparameter tuning with GridSearchCV
- Best model selection based on F1-score
- Model serialization with pickle
- MLflow Integration:
- Experiment tracking with DagHub
- Model versioning and registry
- Hyperparameter logging
- Metric visualization
- Artifact Management:
- Timestamped artifact directories
- Model checkpointing
- Preprocessor versioning
- FastAPI Implementation:
- RESTful endpoints for training and prediction
- File upload support (CSV/Excel)
- CORS middleware for cross-origin requests
- Automatic API documentation (Swagger/ReDoc)
- Model Serving:
- Real-time predictions
- Batch inference support
- HTML table rendering for results
- Custom exception handling throughout pipeline
- Comprehensive logging with timestamps
- Detailed error messages with line numbers
networkSecuritySystem/
βββ network_security/
β βββ components/
β β βββ data_ingestion.py # MongoDB data extraction
β β βββ data_validation.py # Schema & drift validation
β β βββ data_transformation.py # Feature engineering
β β βββ model_trainer.py # Model training & evaluation
β βββ entity/
β β βββ config_entity.py # Configuration dataclasses
β β βββ artifact_entity.py # Pipeline artifact definitions
β βββ constants/
β β βββ training_pipeline.py # Pipeline constants & configs
β βββ utils/
β β βββ main_utils/
β β β βββ utils.py # Helper functions (save/load, GridSearchCV)
β β βββ ml_utils/
β β βββ model/estimator.py # NetworkModel wrapper class
β β βββ metric/ # Evaluation metrics
β βββ exceptions/
β β βββ exception.py # Custom exception classes
β βββ logging/
β βββ logger.py # Logging configuration
βββ data_schema/
β βββ schema.yaml # Data schema definition (31 features)
βββ app.py # FastAPI application
βββ main.py # Training pipeline orchestration
βββ requirements.txt # Python dependencies
βββ README.md
- Python 3.12: Primary language
- Pandas & NumPy: Data manipulation
- Scikit-learn: ML algorithms, preprocessing, metrics
- XGBoost: Gradient boosting framework
- SciPy: Statistical tests for drift detection
- MLflow: Experiment tracking, model registry
- DagHub: Remote MLflow server
- Pickle/Dill: Model serialization
- MongoDB: Data storage
- PyMongo: MongoDB driver
- Certifi: SSL certificate verification
- FastAPI: Web framework
- Uvicorn: ASGI server
- Jinja2: Template rendering
- Python-dotenv: Environment management
# Run complete training pipeline
python main.pyThis will:
- Ingest data from MongoDB
- Validate data quality and detect drift
- Transform features and create preprocessor
- Train multiple models with hyperparameter tuning
- Log experiments to MLflow
- Save best model to
final_model/
curl -X POST "http://localhost:8000/predict" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@test.csv"import requests
url = "http://localhost:8000/predict"
files = {"file": open("test.csv", "rb")}
response = requests.post(url, files=files)
print(response.json())curl -X GET "http://localhost:8000/train"The system expects 31 numerical features related to network security:
| Feature | Type | Description |
|---|---|---|
| having_IP_Address | int64 | IP address present in URL |
| URL_Length | int64 | Length of URL |
| Shortining_Service | int64 | URL shortening service used |
| having_At_Symbol | int64 | '@' symbol present |
| double_slash_redirecting | int64 | '//' after protocol |
| ... | ... | ... (31 total features) |
| Result | int64 | Target variable (0: Safe, 1: Phishing) |
Full schema: data_schema/schema.yaml
Located in network_security/constants/training_pipeline.py:
# Data Ingestion
DATA_INGESTION_COLLECTION_NAME = "NetworkData"
DATA_INGESTION_DATABASE_NAME = "aryan"
DATA_INGESTION_TRAIN_TEST_SPLIT_RATIO = 0.2
# Model Training
MODEL_TRAINER_EXPECTED_SCORE = 0.6
MODEL_TRAINER_OVERFITTING_UNDERFITTING_THRESHOLD = 0.05Configure in model_trainer.py for each algorithm:
- Logistic Regression: penalty, C, solver, max_iter
- KNN: n_neighbors, weights, algorithm
- Random Forest: n_estimators, max_depth, criterion
- XGBoost: learning_rate, max_depth, n_estimators, subsample
- And more...
View experiments at: https://dagshub.com/pycoder49/networkSecuritySystem.mlflow
Logged metrics:
- Training & test F1-scores
- Precision & Recall
- Model parameters
- Training artifacts
- Add CI/CD pipeline with GitHub Actions
- Implement real-time data streaming with Kafka
- Add model monitoring and alerting
- Containerize with Docker
- Deploy on Kubernetes
- Add A/B testing framework
- Implement model explainability (SHAP, LIME)
- Create web dashboard for predictions
- Add automated retraining on data drift detection
- Implement feature store for better feature management
Aryan Ahuja
- Email: aryan-a@outlook.com
- GitHub: @pycoder49
- DagHub: pycoder49/networkSecuritySystem