This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.
-
Multiple Features: The model uses multiple features (Median Income, House Age, and Average Rooms) for more accurate predictions.
-
Data Preprocessing: It includes a machine learning pipeline to handle data scaling, a crucial step for many models.
-
Model Persistence: The trained model is automatically saved to disk (linear_regression_model.joblib), allowing for easy reuse without retraining.
-
Comprehensive Evaluation: The script calculates and prints key metrics (Mean Squared Error and R-squared) to evaluate the model's performance.
-
Data Visualization: It generates and saves multiple plots (housing_prices_plot.png and housing_prices_residual_plot.png) for visual analysis.
-
Prediction Functionality: The script includes a practical example of how to use the trained model to make a prediction on new, unseen data.
-
Python: The core programming language for the project.
-
scikit-learn: A powerful machine learning library used for building the model, data splitting, and evaluation.
-
NumPy: A fundamental library for numerical operations and handling the dataset arrays.
-
Matplotlib: Used for creating the data visualizations, including the scatter and residual plots.
-
joblib: A library for saving and loading the trained machine learning model.
The core of this project is a LinearRegression model, which is a fundamental algorithm in supervised machine learning. The model is implemented within a scikit-learn pipeline. This pipeline's architecture consists of two main stages:
-
Data Preprocessing: The StandardScaler scales the features to have a mean of 0 and a standard deviation of 1. This is crucial for linear models to perform well, as it prevents features with larger values from disproportionately influencing the model.
-
Regression Model: The LinearRegression estimator fits a linear model to the preprocessed data, finding the best-fit line (or hyperplane in this case) that minimizes the sum of squared errors between the predicted and actual values.
The project performs the following data processing steps:
-
Data Splitting: The dataset is divided into a training set (80%) and a testing set (20%) to ensure the model's performance is evaluated on unseen data.
-
Feature Scaling: A StandardScaler is applied to the input features. This process transforms the data such that it has zero mean and unit variance. Scaling prevents features with a larger magnitude from dominating the learning process.
This project performs data analysis through both quantitative metrics and visual inspection:
-
Quantitative Metrics: The model's performance is evaluated using two standard metrics:
-
Mean Squared Error (MSE): Measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit.
-
R-squared (R2): Represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. A score closer to 1.0 indicates a stronger fit.
The model training process is managed to be efficient and reproducible:
-
Training: The fit() method is called on the machine learning pipeline, which first scales the training data and then trains the LinearRegression model.
-
Persistence: Once trained, the entire pipeline is saved to a .joblib file. This is a common practice that "persists" the model, allowing it to be loaded directly for making predictions without the need for a full retraining process. The script intelligently checks for the existence of this file and either loads the existing model or trains a new one.
- Python 3.11+
- Required packages (install via
pip):
- Clone this repository to your local machine:
git clone [https://github.com/sjain2580/simple-linear-regression](https://github.com/sjain2580/simple-linear-regression.git)
cd your-repo-name- Create and activate a virtual environment (optional but recommended):python -m venv venv
- On Windows:
.\venv\Scripts\activate- On macOS/Linux:
source venv/bin/activate- Install the required libraries:
pip install -r requirements.txt- To Run the Script: Simply execute the main Python script from your terminal.
python simple_linear_regression.py-
Prediction Plot: Compares the model's predicted house values against the actual values to show how well the linear relationship is captured.

-
Residual Plot: Plots the difference between the actual and predicted values. A good residual plot shows a random scatter of points around the zero line, indicating that the model's assumptions are met and it is not systematically under- or over-predicting.

https://github.com/sjain2580 Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!
Feel free to reach out if you have any questions or just want to connect!