Regression is a statistical method used to establish a relationship between one or more independent variables (also known as predictors, features, or inputs) and a dependent variable (also known as the outcome, target, or response). The main goal of regression analysis is to predict the value of the dependent variable based on the values of the independent variables.

Regression models can be used for various purposes, such as forecasting future trends, understanding the relationships between variables, and making predictions. Regression analysis is widely used in various fields such as economics, finance, psychology, engineering, and social sciences.

There are several types of regression models, including:

 - Simple linear regression: 
    This is the simplest form of regression, where one independent variable is used to predict a dependent variable. It assumes that there is a linear relationship between the two variables.

- Multiple linear regression:
    This type of regression involves two or more independent variables to predict a dependent variable. It is useful when there are multiple factors that can affect the outcome.

- Polynomial regression:
    This type of regression involves fitting a polynomial equation to the data. It is useful when the relationship between the variables is not linear.

- Logistic regression:
    This type of regression is used when the dependent variable is categorical. It is used to predict the probability of an event occurring.

- Ridge regression:
    This is a type of regression that is used when there is multicollinearity in the data. It adds a penalty term to the regression equation to reduce the impact of correlated variables.

Regression analysis involves several steps, including data collection, data cleaning, variable selection, model building, model validation, and interpretation of results. The accuracy of the regression model can be measured using various metrics such as R-squared, Mean squared error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

In summary, regression is a statistical method used to establish a relationship between independent variables and a dependent variable to make predictions and understand the relationships between variables.

# **DataSet**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [5]:
from sklearn.datasets import load_diabetes

In [14]:
diabetes = load_diabetes(scaled=False)
dataset = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

Variable descriptions on the diabetes dataset from scikit-learn

- $age$     = Age in years
- $sex$     = Sex
- $bmi$     = body mass index
- $bp$      = average blood pressure
- $s1$      = tc, total serum cholesterol
- $s2$      = ldl, low-density lipoproteins
- $s3$      = hdl, high-density lipoproteins
- $s4$      = tch, total cholesterol / HDL
- $s5$      = ltg, possibly log of serum triglycerides level
- $s6$      = glu, blood sugar level


## *Train and Test Split*

In [62]:
X_train, X_test, y_train, y_test = train_test_split(dataset, diabetes.target, test_size=0.1, random_state=42)

In [63]:
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (397, 10) (397,)
Testing set shape: (45, 10) (45,)


# **Linear Regression**

## *Scratch*

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


In [67]:
class LinearRegression:
    def __init__(self, fit_intercept=True, method='normal'):
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
        self.method = method

    def fit(self, X, y, learning_rate=0.01, n_iter=100):
            if self.fit_intercept:
                X = np.c_[np.ones(X.shape[0]), X]
            
            if self.method == 'normal':
                self.coef_ = np.linalg.inv(X.T @ X) @ X.T @ y
                if self.fit_intercept:
                    self.intercept_ = self.coef_[0]
                    self.coef_ = self.coef_[1:]
            elif self.method == 'gradient':
                n_samples, n_features = X.shape
                self.coef_ = np.random.randn(n_features)
                if self.fit_intercept:
                    self.intercept_ = np.random.randn()
                
                for i in range(n_iter):
                    y_pred = X @ self.coef_ + self.intercept_
                    mse = np.mean((y - y_pred)**2)
                    gradient = -2/n_samples * X.T @ (y - y_pred)
                    self.coef_ = self.coef_ - learning_rate * gradient[1:]
                    if self.fit_intercept:
                        self.intercept_ = self.intercept_ - learning_rate * gradient[0]

    def predict(self, X):
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
        return X @ self.coef_

In [65]:
def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def r2_score(y_true, y_pred):
    mean_y = np.mean(y_true)
    ss_tot = np.sum((y_true - mean_y)**2)
    ss_res = np.sum((y_true - y_pred)**2)
    return 1 - ss_res / ss_tot

In [68]:
reg = LinearRegression(fit_intercept=True, method='normal')
reg.fit(X_train, y_train)

# Use the fitted model to make predictions on the test data
y_pred = reg.predict(X_test)


# Calculate the mean squared error and R^2 score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 11)

## *Package*

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
 
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test)


# **Multiple Linear Regression**

# **Polynomial Regression**

# **Ridge Regression**