<a href="https://colab.research.google.com/github/ravi18kumar2021/30Days-DS-to-GenAI/blob/main/Day04/Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Regression?

**Regression** is a supervised learning technique used to predict a continuous values based on one or more input features.

Eg. Predict House price, Stock price, Salaries, Temperatures, etc.

### What is Linear Regression?

**Linear Regression** is a relationship between one or more independent variables (features) and a dependent variable (target) using a straight line:

$y = \beta_0 + \beta_1x + \epsilon$

where
- $y$: predicted value
- $x$: input feature
- $\beta_0$: intercept (the point where the line crosses y-axis)
- $\beta_1$: slope of the line (changes in y per unit of x)
- $\epsilon$: error or noise (what we want to minimize)

In case of multiple features, it becomes:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \beta_nx_n + \epsilon$

### Goal of Linear Regression

Find the best line (or hyperplace in case of higher dimensions) that minimizes the error between predicted and actual values. This is where **cost function** comes into the picture.

### What is a Cost Function?

**Mean Sqaured Error (MSE)**

$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

where
- $y_i$: actual value
- $\hat{y}_i$: predicted value
- $n$: number of samples

We use gradient descent or close-form to find the best $\beta$ values that minimize MSE.

In scikit-learn, this is done internally using the **least squares method.**

### What is Least Sqaures Method?

It's a mathematical way to find the best fit line for regression by:

Finding the values $\beta_0,\beta_1,...$ that minimize the MSE.

### How do we measure Model Performance?

|Metrics|Meaning|
|-------|--------|
|R$^2$ score|percentage of variance explained by model (close to 1 is better)|
|RMSE|Average magnitude of error (in original units)|
|MAE|Mean Absolute Error - less sensitive to outliers|


In [1]:
import numpy as np
import pandas as pd

In [58]:
df = pd.read_csv('https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/refs/heads/main/day48-simple-linear-regression/placement.csv')
df.head()

Unnamed: 0,cgpa,package
0,6.89,3.26
1,5.12,1.98
2,7.82,3.25
3,7.42,3.67
4,6.94,3.57


In [59]:
df.shape

(200, 2)

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   cgpa     200 non-null    float64
 1   package  200 non-null    float64
dtypes: float64(2)
memory usage: 3.3 KB


### ML Workflow

1. Prepare Data
2. Train-Test Split
3. Train Model
4. Predict
5. Evaluate

In [61]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [62]:
# 1. Prepare data
X = df.iloc[:, 0:1]
y = df.iloc[:, -1]

In [63]:
X.shape

(200, 1)

In [64]:
y.shape

(200,)

In [65]:
# 2. Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are going to build our own Linear Regression model which will have the ability to train and make predictions of the data. Before building our custom Linear Regression model, we'll first use scikit-learn's built-in `LinearRegression` class.

In [66]:
# 3. Train Model (using scikit-learn)
model1 = LinearRegression()
model1.fit(X_train, y_train)

In [68]:
# 4. Predict
y_pred1 = model1.predict(X_test)

In [69]:
# 5. Evaluate
r2_scikit = r2_score(y_test, y_pred1)
mse_scikit = mean_squared_error(y_test, y_pred1)

In [70]:
print(r2_scikit)

0.7730984312051673


In [107]:
print(mse_scikit)

0.08417638361329656


In [72]:
model1.coef_

array([0.57425647])

In [73]:
model1.intercept_

np.float64(-1.0270069374542108)

Now, it's time to bulid own Regression Model.
As we are already know, we are trying to find a straight line that best fits each point of the dataset. That line is defined as:

$\hat{y} = \beta_0 + \beta_1x$

where, $\hat{y}$: predicted target, $x$: input feature, $\beta_0$: intercept and $\beta_1$: slope

Inside the fit method, we will take $x$ as input and predict $y$ as output. We need to calculate two things:

$\beta_1 = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sum(x_i - \bar x)^2}$ and $\beta_0 = \bar y - \beta_1\bar x$

where $\bar x$: Mean of input values and $\bar y$: Mean of output values

In [111]:
class MyLinearRegression:
  def __init__(self):
    self.coef_ = None
    self.intercept_ = None

  def fit(self, X_train, y_train):
    num = 0
    den = 0
    for i in range(X_train.shape[0]):
      num += (X_train.values[i] - X_train.mean()) * (y_train.values[i] - y_train.mean())
      den += (X_train.values[i] - X_train.mean())**2

    self.coef_ = (num/den).values
    self.intercept_ = (y_train.mean() - self.coef_ * X_train.mean()).values

  def predict(self, X_test):
    return self.coef_ * X_test + self.intercept_

In [112]:
model2 = MyLinearRegression()
model2.fit(X_train, y_train)

In [113]:
y_pred2 = model2.predict(X_test)

In [114]:
r2_custom = r2_score(y_test, y_pred2)
mse_custom = mean_squared_error(y_test, y_pred2)

In [115]:
r2_custom

0.7730984312051673

In [116]:
mse_custom

0.08417638361329656

In [117]:
model2.coef_

array([0.57425647])

In [118]:
model2.intercept_

array([-1.02700694])