Skip to content

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.

Notifications You must be signed in to change notification settings

sjain2580/Simple-Linear-Regression-Model

Repository files navigation

Simple Linear Regression Model - California Housing Price Prediction with Linear Regression

Overview

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.

Features

  • Multiple Features: The model uses multiple features (Median Income, House Age, and Average Rooms) for more accurate predictions.

  • Data Preprocessing: It includes a machine learning pipeline to handle data scaling, a crucial step for many models.

  • Model Persistence: The trained model is automatically saved to disk (linear_regression_model.joblib), allowing for easy reuse without retraining.

  • Comprehensive Evaluation: The script calculates and prints key metrics (Mean Squared Error and R-squared) to evaluate the model's performance.

  • Data Visualization: It generates and saves multiple plots (housing_prices_plot.png and housing_prices_residual_plot.png) for visual analysis.

  • Prediction Functionality: The script includes a practical example of how to use the trained model to make a prediction on new, unseen data.

Technologies used

  • Python: The core programming language for the project.

  • scikit-learn: A powerful machine learning library used for building the model, data splitting, and evaluation.

  • NumPy: A fundamental library for numerical operations and handling the dataset arrays.

  • Matplotlib: Used for creating the data visualizations, including the scatter and residual plots.

  • joblib: A library for saving and loading the trained machine learning model.

Model used (Architecture)

The core of this project is a LinearRegression model, which is a fundamental algorithm in supervised machine learning. The model is implemented within a scikit-learn pipeline. This pipeline's architecture consists of two main stages:

  1. Data Preprocessing: The StandardScaler scales the features to have a mean of 0 and a standard deviation of 1. This is crucial for linear models to perform well, as it prevents features with larger values from disproportionately influencing the model.

  2. Regression Model: The LinearRegression estimator fits a linear model to the preprocessed data, finding the best-fit line (or hyperplane in this case) that minimizes the sum of squared errors between the predicted and actual values.

Data Processing

The project performs the following data processing steps:

  • Data Splitting: The dataset is divided into a training set (80%) and a testing set (20%) to ensure the model's performance is evaluated on unseen data.

  • Feature Scaling: A StandardScaler is applied to the input features. This process transforms the data such that it has zero mean and unit variance. Scaling prevents features with a larger magnitude from dominating the learning process.

Data Analysis

This project performs data analysis through both quantitative metrics and visual inspection:

  • Quantitative Metrics: The model's performance is evaluated using two standard metrics:

  • Mean Squared Error (MSE): Measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit.

  • R-squared (R2): Represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. A score closer to 1.0 indicates a stronger fit.

Model Training

The model training process is managed to be efficient and reproducible:

  • Training: The fit() method is called on the machine learning pipeline, which first scales the training data and then trains the LinearRegression model.

  • Persistence: Once trained, the entire pipeline is saved to a .joblib file. This is a common practice that "persists" the model, allowing it to be loaded directly for making predictions without the need for a full retraining process. The script intelligently checks for the existence of this file and either loads the existing model or trains a new one.

Prerequisites

  • Python 3.11+
  • Required packages (install via pip):

How to Run the Project

  1. Clone this repository to your local machine:
git clone [https://github.com/sjain2580/simple-linear-regression](https://github.com/sjain2580/simple-linear-regression.git)
cd your-repo-name
  1. Create and activate a virtual environment (optional but recommended):python -m venv venv
  • On Windows:
.\venv\Scripts\activate
  • On macOS/Linux:
source venv/bin/activate
  1. Install the required libraries:
   pip install -r requirements.txt
  1. To Run the Script: Simply execute the main Python script from your terminal.
python simple_linear_regression.py

Visualization

  • Prediction Plot: Compares the model's predicted house values against the actual values to show how well the linear relationship is captured. Housing Prices Plot

  • Residual Plot: Plots the difference between the actual and predicted values. A good residual plot shows a random scatter of points around the zero line, indicating that the model's assumptions are met and it is not systematically under- or over-predicting. Residual Plot

Contributors

https://github.com/sjain2580 Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!

Connect with Me

Feel free to reach out if you have any questions or just want to connect! LinkedIn GitHub Email


About

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published