Skip to content

A clean implementation of logistic regression from scratch using only NumPy, demonstrating the mathematical foundations of binary classification.

Notifications You must be signed in to change notification settings

trngthnh369/logistic-regression-numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logistic Regression with Pure NumPy

A clean implementation of logistic regression from scratch using only NumPy, demonstrating the mathematical foundations of binary classification.

📁 Project Structure

logistic-regression-numpy/
├── README.md                      # Project documentation
├── requirements.txt               # Required dependencies
├── .gitignore                    # Git ignore file
├── data/                         # Data directory
│   └── marks.txt                 # Student exam scores dataset
├── src/                          # Source code
│   ├── __init__.py
│   ├── logistic_regression.py    # LogisticRegression class
│   ├── utils.py                  # Utility functions
│   └── visualization.py          # Plotting functions
├── examples/                     # Usage examples
│   ├── train_example.py          # Complete training script
│   └── demo.ipynb                # Interactive Jupyter notebook
├──notebooks/
│   ├── [Assignment].ipynb        # Notebook assignment 
│   ├── [Solution].ipynb          # Notebook solution
│   └── demo.ipynb                # Interactive Jupyter 
└── tests/                        # Test suite
    ├── __init__.py
    ├── test_logistic_regression.py
    └── test_utils.py

🚀 Features

  • Pure NumPy Implementation: No external ML libraries (scikit-learn, etc.)
  • Mathematical Foundation: Clear implementation of sigmoid function, cost function, and gradient descent
  • Data Preprocessing: Normalization and bias term handling
  • Visualization: Decision boundary and training progress plots
  • Modular Design: Clean separation of concerns
  • Unit Tests: Comprehensive test coverage
  • Documentation: Well-documented code with mathematical explanations

📊 Mathematical Background

Sigmoid Function

σ(z) = 1 / (1 + e^(-z))

Cost Function (Log-likelihood)

J(θ) = -1/m * Σ[y*log(h_θ(x)) + (1-y)*log(1-h_θ(x))]

Gradient

∂J(θ)/∂θ = 1/m * X^T * (h_θ(X) - Y)

Parameter Update

θ := θ - α * ∇J(θ)

🛠️ Installation

  1. Clone the repository:
git clone https://github.com/trngthnh369/logistic-regression-numpy.git
cd logistic-regression-numpy
  1. Install dependencies:
pip install -r requirements.txt

Run Complete Example

python examples/train_example.py

Interactive Demo

jupyter notebook notebooks/demo.ipynb

🎯 Quick Start

Basic Usage

from src.logistic_regression import LogisticRegression
from src.utils import load_data, normalize_data, train_test_split
from src.visualization import plot_data, plot_decision_boundary

# Load and prepare data
"data = load_data('data/marks.txt')
X, y = data[:, :-1], data[:, -1:]

# Split data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features
train_X_norm, test_X_norm = normalize_data(train_X, test_X)

# Create and train model
model = LogisticRegression(learning_rate=0.001, epochs=5000)
model.fit(train_X_norm, train_y)

# Make predictions
predictions = model.predict(test_X_norm)
accuracy = model.accuracy(test_X_norm, test_y)

print(f"Test Accuracy: {accuracy:.2f}%")

📈 Results

The model achieves ~89.5% accuracy on the test set, demonstrating effective binary classification using pure NumPy implementation.

🧪 Testing

Run the test suite:

python -m pytest tests/ -v

Run tests with coverage:

python -m pytest tests/ --cov=src --cov-report=html

📚 Key Components

LogisticRegression Class (src/logistic_regression.py)

  • fit(): Train the model using gradient descent
  • predict(): Make binary predictions
  • predict_proba(): Get prediction probabilities
  • accuracy(): Calculate classification accuracy

Utility Functions (src/utils.py)

  • Data loading and preprocessing
  • Feature normalization
  • Train-test splitting
  • Data quality checks

Visualization (src/visualization.py)

  • Data scatter plots with class separation
  • Decision boundary visualization
  • Training progress plots
  • Confusion matrices and probability distributions

🔬 Mathematical Implementation Details

  1. Feature Normalization: Z-score standardization using training statistics
  2. Bias Term: Automatically added as intercept column
  3. Gradient Descent: Iterative optimization with configurable learning rate
  4. Sigmoid Activation: Ensures output probabilities between 0 and 1
  5. Cross-entropy Loss: Appropriate cost function for binary classification
  6. Numerical Stability: Clipping to prevent overflow/underflow

🎓 Educational Value

This implementation is designed for learning purposes to understand:

  • Mathematical foundations of logistic regression
  • Gradient descent optimization
  • Binary classification principles
  • Feature preprocessing importance
  • Model evaluation techniques

🧩 Extending the Project

Potential extensions:

  • Multi-class logistic regression (softmax)
  • Regularization (L1/L2)
  • Different optimization algorithms
  • Feature engineering utilities
  • Advanced visualization options

📊 Performance Metrics

The model tracks and provides:

  • Training and validation accuracy
  • Cost function evolution
  • Confusion matrix analysis
  • Probability distribution plots
  • Feature importance (weights)

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for new functionality
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

🙏 Acknowledgments

  • Original dataset from Machine Learning course materials
  • Mathematical formulations based on Andrew Ng's Machine Learning course
  • NumPy community for excellent documentation

📞 Contact

If you have questions or suggestions, feel free to:

  • Open an issue on GitHub
  • Contact the author
  • Contribute to the project

Note: This implementation is for educational purposes to understand the mathematical foundations of logistic regression. For production use, consider optimized libraries like scikit-learn.

About

A clean implementation of logistic regression from scratch using only NumPy, demonstrating the mathematical foundations of binary classification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published