<a href="https://colab.research.google.com/github/themodernturing/Linear-Regression-Explained/blob/main/Linear_Regression_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook introduces linear regression through the story of a Princeton economics professor who built a model to predict the quality and price of Bordeaux wines. He used variables like weather, age, etc. rather than tasting the wines. His model outperformed expert opinions in predicting wine prices and quality.

The document walks through linear regression methodology like selecting variables, avoiding overfitting, and measuring predictive performance. Ultimately, it shows how a quantitative approach can be successfully applied to a traditionally qualitative domain.

In [None]:
import pandas as pd

# Load the dataset
wine_data = pd.read_csv('wine.csv')

# Display the first few rows of the dataframe
display(wine_data.head())

Unnamed: 0,Year,Price,WinterRain,AGST,HarvestRain,Age,FrancePop
0,1952,7.495,600,17.1167,160,31,43183.569
1,1953,8.0393,690,16.7333,80,30,43495.03
2,1955,7.6858,502,17.15,130,28,44217.857
3,1957,6.9845,420,16.1333,110,26,45152.252
4,1958,6.7772,582,16.4167,187,25,45653.805


Now that we have loaded the data, let's perform some basic exploratory data analysis (EDA) to understand the structure and the distribution of the data.

In [None]:
# Basic information about the dataset
display(wine_data.info())

# Summary statistics for the dataset
display(wine_data.describe())

# Check for missing values
display(wine_data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Year         25 non-null     int64  
 1   Price        25 non-null     float64
 2   WinterRain   25 non-null     int64  
 3   AGST         25 non-null     float64
 4   HarvestRain  25 non-null     int64  
 5   Age          25 non-null     int64  
 6   FrancePop    25 non-null     float64
dtypes: float64(3), int64(4)
memory usage: 1.5 KB


None

Unnamed: 0,Year,Price,WinterRain,AGST,HarvestRain,Age,FrancePop
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,1965.8,7.067224,605.28,16.509336,148.56,17.2,49694.43676
std,7.691987,0.650341,132.277965,0.675397,74.419464,7.691987,3665.270243
min,1952.0,6.2049,376.0,14.9833,38.0,5.0,43183.569
25%,1960.0,6.5188,536.0,16.2,89.0,11.0,46583.995
50%,1966.0,7.1211,600.0,16.5333,130.0,17.0,50254.966
75%,1972.0,7.495,697.0,17.0667,187.0,23.0,52894.183
max,1978.0,8.4937,830.0,17.65,292.0,31.0,54602.193


Year           0
Price          0
WinterRain     0
AGST           0
HarvestRain    0
Age            0
FrancePop      0
dtype: int64

Running a** multiple linear regression using Price** as Dependent variable

In [None]:
import statsmodels.api as sm

# Define the dependent variable
Y = wine_data['Price']

# Define the independent variables
X = wine_data[['WinterRain', 'AGST', 'HarvestRain', 'Age', 'FrancePop']]

# Add a constant to the independent variables matrix (for the intercept)
X = sm.add_constant(X)

# Fit the regression model
reg_model = sm.OLS(Y, X).fit()

# Print out the statistics
reg_model.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.829
Model:,OLS,Adj. R-squared:,0.784
Method:,Least Squares,F-statistic:,18.47
Date:,"Tue, 23 Jan 2024",Prob (F-statistic):,1.04e-06
Time:,09:21:21,Log-Likelihood:,-2.1043
No. Observations:,25,AIC:,16.21
Df Residuals:,19,BIC:,23.52
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.4504,10.189,-0.044,0.965,-21.776,20.875
WinterRain,0.0010,0.001,1.963,0.064,-6.89e-05,0.002
AGST,0.6012,0.103,5.836,0.000,0.386,0.817
HarvestRain,-0.0040,0.001,-4.523,0.000,-0.006,-0.002
Age,0.0006,0.079,0.007,0.994,-0.165,0.166
FrancePop,-4.953e-05,0.000,-0.297,0.770,-0.000,0.000

0,1,2,3
Omnibus:,1.769,Durbin-Watson:,2.792
Prob(Omnibus):,0.413,Jarque-Bera (JB):,1.026
Skew:,-0.005,Prob(JB):,0.599
Kurtosis:,2.008,Cond. No.,8410000.0


##Explanation of Regression Model Results
The output of a regression model typically includes several statistical measures and tests that help in understanding the performance and validity of the model. Here's a breakdown of the key components of regression model results:

R-squared (R2): This is a measure of how well the independent variables explain the variability of the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit.

Adjusted R-squared: Similar to R2, but adjusted for the number of predictors in the model. It's used to compare the explanatory power of regression models that have different numbers of independent variables.

F-statistic: This is a test statistic for the overall significance of the regression model. It tests whether at least one of the regression coefficients is not equal to zero.

Prob (F-statistic): The p-value corresponding to the F-statistic. A low p-value (typically < 0.05) suggests that the overall regression model is statistically significant.

Coefficients: These values represent the estimated impact of each independent variable on the dependent variable. For example, a coefficient of 2 for a variable would mean that for each unit increase in that variable, the dependent variable is expected to increase by 2 units, all else being equal.

Standard Error: This measures the average distance that the observed values fall from the regression line. A lower standard error indicates a more precise estimate of the coefficient.

t-statistic: This is used to test the null hypothesis that a coefficient is equal to zero (no effect). A larger absolute value of the t-statistic indicates a more significant coefficient.

P>|t|: The p-value corresponding to the t-statistic of each coefficient. A low p-value indicates that the coefficient is statistically significant.

Confidence Interval: This provides a range within which the true coefficient is likely to fall, with a certain level of confidence (usually 95%).

Durbin-Watson: A test statistic that detects the presence of autocorrelation in the residuals from a regression analysis. Values close to 2 suggest there is no autocorrelation.

Omnibus/Prob(Omnibus): A test of the skewness and kurtosis of the residual (error) distribution. A non-significant result (high p-value) suggests that the residuals are normally distributed.

Jarque-Bera (JB)/Prob(JB): Another test of whether the residuals are normally distributed. Similar to the Omnibus test, a high p-value indicates normality.

Condition Number: A measure of the sensitivity of the model's outputs to its inputs. A high condition number indicates potential problems with multicollinearity.

Understanding these statistics is crucial for interpreting the regression model and making informed decisions based on its results.

##If you see the Notes of regression output. It mentions he condition number is large, 8.41e+06. This might indicate that there are strong multicollinearity or other numerical problems.

##To address the issue of multicollinearity in your regression model, we can perform the following steps:

Variance Inflation Factor (VIF): Calculate the VIF for each predictor variable. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. A rule of thumb is that if VIF is greater than 10, multicollinearity is high.

Remove Variables: Based on the VIF, remove variables that are causing multicollinearity.

Rebuild Model: After removing the problematic variables, rebuild the regression model and check the condition number again.

Let's start by calculating the VIF for each variable in your model.

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load the dataset
wine_data = pd.read_csv('wine.csv')

# Selecting the independent variables
X = wine_data.drop(columns=['Price'])

# Adding a constant to the model (intercept)
X = sm.add_constant(X)

# Calculating VIF for each feature
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

# Display the VIF for each predictor variable
vif_data

  return 1 - self.ssr/self.centered_tss
  vif = 1. / (1. - r_squared_i)


Unnamed: 0,feature,VIF
0,const,0.0
1,Year,inf
2,WinterRain,1.298801
3,AGST,1.274536
4,HarvestRain,1.116584
5,Age,inf
6,FrancePop,98.252693


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Assuming `X` is the DataFrame with predictor variables
# Add a constant to the predictor variable set
X['const'] = 1

# Drop the 'FrancePop' variable
X = X.drop(columns=['FrancePop'])

# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns

# Calculating VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif_data

  return 1 - self.ssr/self.centered_tss
  vif = 1. / (1. - r_squared_i)


Unnamed: 0,feature,VIF
0,const,0.0
1,Year,inf
2,WinterRain,1.241993
3,AGST,1.225811
4,HarvestRain,1.113615
5,Age,inf


In [None]:
from statsmodels.regression.linear_model import OLS

# Assuming `wine_data` is the DataFrame with the response variable 'Price'
# and `X` is the DataFrame with predictor variables without 'FrancePop'

# Fit the OLS model
model = OLS(wine_data['Price'], X).fit()

# Output the summary of the model
model_summary = model.summary()

# Get the R-squared value
r_squared = model.rsquared

# Get the R-squared value for the test data
r_squared_test = model.rsquared_adj
r_squared_test

0.7942794632109139

In [None]:
import pandas as pd
import statsmodels.api as sm

# Load the test data
wine_test = pd.read_csv('wine_test.csv')

# Assuming `X_test` is the DataFrame with predictor variables for the test data
# Drop the extra column 'FrancePop' from the test data
X_test = wine_test.drop(columns=['Price', 'FrancePop'])

# Add a constant to the test data for the model prediction
X_test = sm.add_constant(X_test)

# Use the existing model to predict the Price using the test data
wine_test['Predicted_Price'] = model.predict(X_test)

# Display the first few rows of the test data with the predicted prices
wine_test.head()


Unnamed: 0,Year,Price,WinterRain,AGST,HarvestRain,Age,FrancePop,Predicted_Price
0,1979,6.9541,717,16.1667,122,4,54835.832,6.768925
1,1980,6.4979,578,16.0,74,3,55110.236,6.68491
