# Linear Regression 1
We establish a theoretical framework for parametric supervised learning methods, aiming to rigorously define the setup of such problems and discuss predictive and explanatory modeling.


## Framework Overview
The data contains inputs ($X$) and outputs ($Y$), and in a supervised learning problem, a statistical model represents the relationship between $X$ and $Y$:
$$Y=f(X)+\epsilon$$
where

- $f$ : the function we want to approximate.
- $\epsilon$ : random noise, which may or may not be independent of 
$X$.
- $X$ : Input variables, features, independent variables, explanatory variables, regressors, covariates, predictors.
- $Y$ : Output variable, target, outcome, dependent variable, regressand, response.


We consider functions of a specific form called a model, parameterized by $\beta_0$ and $\beta_1$. Adjusting the parameters provides different instances of the model. The evaluation of each $\beta_0$ and $\beta_1$ is done using a loss function, which scores how well the model reproduces observed input/output pairs.


## Key Steps in Supervised Learning
1.	Data: Input and output pairs $(X,Y)$.
2.	Model: Choose a model $f(X,\beta)$ parameterized by $\beta$.
3.	Loss Function: Select a loss function $L(\beta)$ to evaluate the model's performance.
4.	Minimization: Minimize the loss function to obtain the fitted model with parameters $L(\beta)$.


Illustration: We simulate the relationship between patient weight and blood volume using a linear regression model, generating systematic and observed data. We then train a linear regression algorithm and visualize the true relationship, observed data, and the fitted model.


Two Main Modeling Goals:
1.	Making Predictions: The goal is to produce a model for accurate predictions of new observations.
2.	Making Inferences: The goal is to produce a model explaining the relationship between $X$ and $Y$, aiming for explanatory models that explain variance while being parsimonious.
While these notes primarily focus on predictive modeling, both predictive and explanatory modeling play essential roles in data science and research.


## Multiple Linear Regression

### Objective
- Introduce the multiple linear regression model and demonstrate how to fit the model using the normal equation.
  
### Model Description

- For predicting a quantitative variable $Y$, the multiple linear regression model regresses $Y$ on a set of $p$ features $(X_1, X_2, ..., X_p)$.
- The model is expressed as: 
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \varepsilon,$$
where $\beta_0, \beta_1, ..., \beta_p$ are constants, $\mathbf{X}$ is the feature vector, and $\varepsilon$ is the error term.
  
### Fitting the Model

- Given $n$ observations, package them into an $n \times (p+1)$ matrix $\mathbf{X}$ with columns of the $p$ features and a $ n \times 1 $ vector $\mathbf{Y}$.
- The goal is to minimize the mean squared error (MSE) to find the parameter vector $\beta$ that best fits the data.
- Mathematically, this minimization problem can be expressed as solving the normal equation: $ \mathbf{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}.$




### Linear Algebra Interpretation
- Geometrically, this process can be viewed as projecting $\mathbf{Y}$ into the subspace spanned by the columns of $\mathbf{X}$.
  
### Implementation Steps
1. Prepare the data matrix $\mathbf{X}$ and the target vector $\mathbf{Y}$.
2. Apply the normal equation to find the coefficient vector $\beta$.
3. This coefficient vector represents the ordinary least squares estimate, minimizing the MSE.



### Libraries Used for the next example
- `pandas` for data manipulation.
- `numpy` for numerical operations.
- `matplotlib.pyplot` for plotting.

Note:
- The term "padded" refers to adding a constant term to capture the intercept in the model. This is often done for a more comprehensive representation of the linear relationship.

## Example : Start-up company data

In [50]:
import pandas as pd
# Load the dataset
startup_data = pd.read_csv('50_Startups.csv')
print(startup_data.shape)
startup_data.head(5)

(50, 5)


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


### Data description
- Data dimension is (50 $\times$ 5) with 50 companies for 5 column categories
- R&D Spend, Marketing Spend : The money spent from the company
- Administration : Administrative money sepnt from the company
- State : 3 Location of the company. They are New York, California and Florida
- Profit : Income of the company

In [70]:
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import inv, det


X = startup_data.iloc[:, :-1]

y = startup_data.iloc[:, 4]


# Change into categorical variable

states=pd.get_dummies(X['State'],drop_first=True)
print('The new column states is :\n', states.head(5))
# Deleting unnecessary column

X= X.drop('State',axis=1)

# Independent variables and cateorical variable concatation.

X=pd.concat([X,states],axis=1)

#Convert the True and False of X into integer 0 and 1
X['Florida']= X['Florida'].astype(int)
X['New York']= X['New York'].astype(int)

print('\n Dimension of X is :\n', X.shape)
print('\n X is :\n', X.head(5))
print('\n y is :\n', y.head(5))

The new column states is :
    Florida  New York
0    False      True
1    False     False
2     True     False
3    False      True
4     True     False

 Dimension of X is :
 (50, 5)

 X is :
    R&D Spend  Administration  Marketing Spend  Florida  New York
0  165349.20       136897.80        471784.10        0         1
1  162597.70       151377.59        443898.53        0         0
2  153441.51       101145.55        407934.54        1         0
3  144372.41       118671.85        383199.62        0         1
4  142107.34        91391.77        366168.42        1         0

 y is :
 0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64


In [64]:
# Importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                        shuffle = True, test_size=0.5, random_state=40)

# Import the LinearRegression module
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression class
linear_regression = LinearRegression()

# Fit the training data
linear_regression.fit(X_train, y_train)


#Calculate the beta hat
## Add a column of 1s to X for beta_0, therefore we make augmented matrix
X_train = np.insert(X_train, 0, 1, axis=1)
#numpy.insert(arr, obj, values, axis=None). Insert values along the given axis before the given indices.


beta_hat = np.linalg.inv(X_train.transpose().dot(X_train)).dot(X_train.transpose()).dot(y_train)




# Oberving beta_hat
print("beta_0_hat =", beta_hat[0])
print("beta_1_hat =", beta_hat[1])
print("beta_2_hat =", beta_hat[2])

# Make predictions
y_pred = linear_regression.predict(X_test)
y_pred



beta_0_hat = 46429.14278898854
beta_1_hat = 0.8125938745884593
beta_2_hat = 0.01571854278720394


array([ 95090.50839871,  98047.61536855,  49145.88809871,  61324.98848333,
        43050.63588856, 185931.96938749, 126286.35357384, 168233.67940151,
       134531.11668461, 178408.788892  , 116870.97339977,  82971.07936474,
       103850.14151027, 112190.58639591,  67434.64568031,  71398.76636008,
       106667.06010749,  73403.15985783, 156032.79708505, 128221.87876296,
        82466.98985232, 124939.36606774,  55542.07337345, 107147.44290074,
       166878.24922446])

In [65]:
# Import the r2_score and mean_squared_error modules
from sklearn.metrics import r2_score, mean_squared_error

# Calculate the R2 score
r2_score_value = r2_score(y_test, y_pred)

# Calculate the mean squared error
mean_squared_error_value = mean_squared_error(y_test, y_pred)

# Calculate the root mean squared error
root_mean_squared_error_value = np.sqrt(mean_squared_error(y_test, y_pred))

print('r2 score:', r2_score_value)
print('mean squared error:', mean_squared_error_value)
print('root mean squared error:', root_mean_squared_error_value)


r2 score: 0.9209128758791416
mean squared error: 141041281.455852
root mean squared error: 11876.080222693514
