# Linear Regression

This line is known as the regression line and is represented by a linear equation Y= a *X + b.

In this equation:

- Y – Dependent Variable
- a – Slope
- X – Independent variable
- b – Intercept

Definition: Finds best-fit line for continuous values</br>
Use Cases: House prices, sales forecasting</br>
Strengths: Simple, interpretable, baseline model

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler

## Create Dataset

In [2]:
np.random.seed(42)
size = np.random.randint(500, 3000, 200)
bedrooms = np.random.randint(1, 6, 200)
age = np.random.randint(0, 50, 200)

price = 150*size + 20000*bedrooms - 500*age + np.random.normal(0, 50000, 200)

In [3]:
df = pd.DataFrame({'size':size, 'bedrooms':bedrooms, 'age': age, 'price':price})

In [4]:
df.head()

Unnamed: 0,size,bedrooms,age,price
0,1360,3,47,251704.624091
1,1794,2,20,299729.620039
2,1630,3,38,290383.804927
3,1595,3,35,243099.510807
4,2138,4,32,385925.508713


In [5]:
df.shape

(200, 4)

In [6]:
df.describe()

Unnamed: 0,size,bedrooms,age,price
count,200.0,200.0,200.0,200.0
mean,1793.225,2.94,25.195,313009.969349
std,679.916795,1.434114,14.218219,111427.034908
min,501.0,1.0,0.0,32193.273545
25%,1259.0,2.0,12.0,218793.265298
50%,1801.5,3.0,25.5,316562.487388
75%,2327.5,4.0,36.0,400144.28329
max,2989.0,5.0,49.0,560622.938293


## Feature Engineering

In [7]:
df.corr()['price']

size        0.862757
bedrooms    0.165222
age        -0.106272
price       1.000000
Name: price, dtype: float64

In [8]:
df['price_per_sqft'] = df['price'] / df['size']

In [9]:
df['room_ratio'] = df['size'] / df['bedrooms']

## Train Test Split

In [10]:
X = df.drop(['price'], axis=1)
y = df['price']

In [11]:
X

Unnamed: 0,size,bedrooms,age,price_per_sqft,room_ratio
0,1360,3,47,185.076929,453.333333
1,1794,2,20,167.073367,897.000000
2,1630,3,38,178.149574,543.333333
3,1595,3,35,152.413486,531.666667
4,2138,4,32,180.507722,534.500000
...,...,...,...,...,...
195,1647,4,12,257.935665,411.750000
196,686,5,27,339.586349,137.200000
197,2294,4,48,167.305503,573.500000
198,1159,2,24,225.979893,579.500000


In [12]:
y

0      251704.624091
1      299729.620039
2      290383.804927
3      243099.510807
4      385925.508713
           ...      
195    424820.039932
196    232956.235172
197    383798.823392
198    261910.696108
199    371846.231349
Name: price, Length: 200, dtype: float64

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training

In [14]:
model = LinearRegression()
model.fit(X_train, y_train)

## Evaluation

In [15]:
model.predict([[1360, 3, 47, 185.076929, 453.333333]])



array([238499.14501563])

In [24]:
y_pred = model.predict(X_test)

In [25]:
print("\nR²:", r2_score(y_test, y_pred))


R²: 0.927712751376091


In [26]:
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 30431.372926852153


In [27]:
print("MAE:", mean_absolute_error(y_test, y_pred))

MAE: 23832.138994932924


In [20]:
print("Train R²:", model.score(X_train, y_train))
print("Test R² :", model.score(X_test, y_test))

Train R²: 0.9399506540831034
Test R² : 0.927712751376091
