A clean implementation of logistic regression from scratch using only NumPy, demonstrating the mathematical foundations of binary classification.
logistic-regression-numpy/
├── README.md # Project documentation
├── requirements.txt # Required dependencies
├── .gitignore # Git ignore file
├── data/ # Data directory
│ └── marks.txt # Student exam scores dataset
├── src/ # Source code
│ ├── __init__.py
│ ├── logistic_regression.py # LogisticRegression class
│ ├── utils.py # Utility functions
│ └── visualization.py # Plotting functions
├── examples/ # Usage examples
│ ├── train_example.py # Complete training script
│ └── demo.ipynb # Interactive Jupyter notebook
├──notebooks/
│ ├── [Assignment].ipynb # Notebook assignment
│ ├── [Solution].ipynb # Notebook solution
│ └── demo.ipynb # Interactive Jupyter
└── tests/ # Test suite
├── __init__.py
├── test_logistic_regression.py
└── test_utils.py
- Pure NumPy Implementation: No external ML libraries (scikit-learn, etc.)
- Mathematical Foundation: Clear implementation of sigmoid function, cost function, and gradient descent
- Data Preprocessing: Normalization and bias term handling
- Visualization: Decision boundary and training progress plots
- Modular Design: Clean separation of concerns
- Unit Tests: Comprehensive test coverage
- Documentation: Well-documented code with mathematical explanations
σ(z) = 1 / (1 + e^(-z))
J(θ) = -1/m * Σ[y*log(h_θ(x)) + (1-y)*log(1-h_θ(x))]
∂J(θ)/∂θ = 1/m * X^T * (h_θ(X) - Y)
θ := θ - α * ∇J(θ)
- Clone the repository:
git clone https://github.com/trngthnh369/logistic-regression-numpy.git
cd logistic-regression-numpy
- Install dependencies:
pip install -r requirements.txt
python examples/train_example.py
jupyter notebook notebooks/demo.ipynb
from src.logistic_regression import LogisticRegression
from src.utils import load_data, normalize_data, train_test_split
from src.visualization import plot_data, plot_decision_boundary
# Load and prepare data
"data = load_data('data/marks.txt')
X, y = data[:, :-1], data[:, -1:]
# Split data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize features
train_X_norm, test_X_norm = normalize_data(train_X, test_X)
# Create and train model
model = LogisticRegression(learning_rate=0.001, epochs=5000)
model.fit(train_X_norm, train_y)
# Make predictions
predictions = model.predict(test_X_norm)
accuracy = model.accuracy(test_X_norm, test_y)
print(f"Test Accuracy: {accuracy:.2f}%")
The model achieves ~89.5% accuracy on the test set, demonstrating effective binary classification using pure NumPy implementation.
Run the test suite:
python -m pytest tests/ -v
Run tests with coverage:
python -m pytest tests/ --cov=src --cov-report=html
fit()
: Train the model using gradient descentpredict()
: Make binary predictionspredict_proba()
: Get prediction probabilitiesaccuracy()
: Calculate classification accuracy
- Data loading and preprocessing
- Feature normalization
- Train-test splitting
- Data quality checks
- Data scatter plots with class separation
- Decision boundary visualization
- Training progress plots
- Confusion matrices and probability distributions
- Feature Normalization: Z-score standardization using training statistics
- Bias Term: Automatically added as intercept column
- Gradient Descent: Iterative optimization with configurable learning rate
- Sigmoid Activation: Ensures output probabilities between 0 and 1
- Cross-entropy Loss: Appropriate cost function for binary classification
- Numerical Stability: Clipping to prevent overflow/underflow
This implementation is designed for learning purposes to understand:
- Mathematical foundations of logistic regression
- Gradient descent optimization
- Binary classification principles
- Feature preprocessing importance
- Model evaluation techniques
Potential extensions:
- Multi-class logistic regression (softmax)
- Regularization (L1/L2)
- Different optimization algorithms
- Feature engineering utilities
- Advanced visualization options
The model tracks and provides:
- Training and validation accuracy
- Cost function evolution
- Confusion matrix analysis
- Probability distribution plots
- Feature importance (weights)
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Add tests for new functionality
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Original dataset from Machine Learning course materials
- Mathematical formulations based on Andrew Ng's Machine Learning course
- NumPy community for excellent documentation
If you have questions or suggestions, feel free to:
- Open an issue on GitHub
- Contact the author
- Contribute to the project
Note: This implementation is for educational purposes to understand the mathematical foundations of logistic regression. For production use, consider optimized libraries like scikit-learn.