# Introduction to Linear Regression

Linear regression is a basic and commonly used type of predictive analysis. It establishes a linear relationship between a dependent variable (Y) and one or more independent variables (X)

## Key Concepts
- **Dependent Variable (Y)**: The variable we are trying to predict or explain.
- **Independent Variable (X)**: The variable we use to make predictions.
- **Data Points**: Each individual pair of X and Y values.
- **Regression Line**: The line that best fits all the data points.
- **Slope**: The rate at which Y changes with respect to X.
- **Intercept**: The value of Y when X is 0.
- **Random Error**: The difference between the observed value and the value predicted by the model.

## How Does It Work?

The goal of linear regression is to find the best-fitting straight line through the data points. This line can be used to predict values of Y based on X. The line is defined by the equation:

**Y = a + bX**

Where:

- **Y** is the predicted value of the dependent variable (what we want to find out).
- **X** is the independent variable (the input).
- **a** is the intercept of the line (where the line crosses the Y-axis).
- **b** is the slope of the line (how much Y changes with each unit change in X).


### a and b can be computed by the following formulas:

![Alt Text](formula.png)

### Regression

![Alt Text](Regression.png)

## Example: Predicting House Prices,
Let's create a simple example to predict house prices based on the number of rooms

In [None]:
!pip install scikit-learn

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("Bengaluru_House_Data.csv")
print(df.head())

In [None]:
# Plotting the data
plt.figure(figsize=(10, 6))
plt.scatter(df['total_sqft'], df['price'], color='blue')
plt.title('House Prices vs. Total Square Feet')
plt.xlabel('Total Square Feet')
plt.ylabel('House Price(In Lakhs)')
plt.show()

In the above plot:
- **X-axis**:Total Square Feet (Independent Variable)
- **Y-axis**: House Price (Dependent Variable)

The blue dots represent our data points.

## Building the Regression Model
Now, let's build a linear regression model to predict house prices based on the number of rooms.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = df[['total_sqft']]
y = df[['price']]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Get the model parameters
slope = model.coef_
intercept = model.intercept_

# Print the slope and intercept
print(f"Slope: {slope}")
print(f"Intercept: {intercept}")

The `LinearRegression` model in `scikit-learn` computes the best-fitting line for our data by minimizing the sum of squared differences between the observed values (actual house prices) and the predicted values (values on the regression line). The `slope` represents how much the house price changes for each additional room, and the `intercept` is the predicted house price when the number of rooms is zero.

## Making Predictions
Let's use our trained model to make predictions.

In [None]:
# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

In [None]:
# Plotting the data points and the regression line
plt.figure(figsize=(10, 6))

# Plot training data points
plt.scatter(X_train, y_train, color='blue', label='Training data points')
plt.scatter(X_test, y_test, color='green', label='Test data points')
plt.plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression line (Train)')

# Add labels, title, and legend
plt.title('House Prices vs. Total Square Feet')
plt.xlabel('Total Square Feet')
plt.ylabel('Price')
plt.legend()
plt.show()

The red line in the plot above is the regression line. It represents the best-fit line that minimizes the distance between the observed data points (blue dots) and the predicted values (red line).

## Evaluating the Model
We can evaluate how well our model fits the data using metrics like Mean Squared Error (MSE) and R-squared (R²).

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
# Calculate and print the mean squared error and R^2 score
mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"Training set Mean Squared Error: {mse_train}")
print(f"Training set R^2 Score: {r2_train}")
print(f"Test set Mean Squared Error: {mse_test}")
print(f"Test set R^2 Score: {r2_test}")

The Mean Squared Error (MSE) measures the average squared difference between the observed values and the predicted values. A lower MSE indicates a better fit.

The R-squared (R²) score is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It ranges from 0 to 1, where 1 indicates a perfect fit.

In [None]:
def predict_price(total_sqft):
    # Ensure the input is a DataFrame with the correct feature name
    total_sqft_df = pd.DataFrame({'total_sqft': [total_sqft]})

    predicted_price = model.predict(total_sqft_df)
    return predicted_price[0][0]

In [None]:
print(predict_price(500))

## Summary
In this notebook, we:
- Explained the key concepts of simple linear regression
- Used a simple example to demonstrate simple linear regression
- Built a simple linear regression model
- Made predictions using the model
- Visualized the regression line
- Evaluated the model's performance using MSE and R-squared

This completes our introduction to simple linear regression.