# Lab 2.10.3: Linear Regression: Ratings
A popular restaurant review website has released the attached dataset (ratings.xlsx Download ratings.xlsx). Here, each row represents an average rating of a restaurant in different aspects as provided by the previous customers. The dataset contains records for the restaurants using the following attributes: `ambiance, food, service, and overall rating`. The first three attributes are predictor variables and the remaining one is the outcome.<p>

## Instructions
Use a linear regression model to predict how the predictor attributes impact the overall rating of the restaurant.<p>

First, express the linear regression in mathematical form.<p>

Then, try solving it by hand as we did in the live session. Here, you will have four parameters (the constant, and the three attributes/predictors), with one outcome. You do not have to actually solve this with all possible values for these parameters. Rather, show a couple of possible sets of values for the parameters with the outcome value calculated.<p>

Finally, use your favorite programming tool to find the linear regression model and report it in appropriate terms (do not just dump the output from Python).<p>

In [4]:
import pandas as pd
df = pd.read_excel('LAB 2.10.3_ ratings.xlsx')
print(df.columns)
display(df)

Index(['restaurant', 'food', 'ambience', 'service', 'rating'], dtype='object')


Unnamed: 0,restaurant,food,ambience,service,rating
0,1,85,82,89,78
1,2,80,90,80,85
2,3,83,86,83,85
3,4,70,96,75,72
4,5,68,80,78,75
5,6,65,70,56,54
6,7,64,68,61,62
7,8,72,95,72,73
8,9,69,70,78,70
9,10,75,80,75,77


The basic eqaution to represent this relationship is:<p>
    $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$<p>

We can solve by Ordinary Least Squares:<p>
$\bar{x_1} = \frac{\sum_{i=1}^{n} x_i}{n} \quad (\text{Mean of food})$<p>
$\bar{x_2} = \frac{\sum_{i=1}^{n} x_i}{n} \quad (\text{Mean of ambience})$<p>
$\bar{x_3} = \frac{\sum_{i=1}^{n} x_i}{n} \quad (\text{Mean of service})$<p>
$\bar{y} = \frac{\sum_{i=1}^{n} y_i}{n} \quad (\text{Mean of rating})$<p>
    
$\beta_1 = \frac{\sum_{i=1}^{n} (x_1i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_1i - \bar{x_1})^2} \quad (\text{slope of food}  \beta_1)$<p>
$\beta_2 = \frac{\sum_{i=1}^{n} (x_2i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_2i - \bar{x_2})^2} \quad (\text{slope of ambience}  \beta_2)$<p>
$\beta_3 = \frac{\sum_{i=1}^{n} (x_3i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_3i - \bar{x_3})^2} \quad (\text{slope of service}  \beta_3)$<p>
$\beta_0 = \bar{y} - \beta_1 \bar{x}_1 - \beta_2 \bar{x}_2 - \beta_3 \bar{x}_3 \quad (\text{y-intercept of}  \beta_0)$<p>
$\epsilon \quad \text{is the error}$<p>

However, it must be calculated like this:<p>
$\beta = (X^T X)^{-1} X^T y$
    
## Manual Calculation

In [12]:
import numpy as np
# Prepare the data matrices
X = df[['food', 'ambience', 'service']].values
y = df['rating'].values
n = len(y)

# Add a column of ones to X for the intercept term
X = np.concatenate((np.ones((n, 1)), X), axis=1)

# Calculate the coefficients using the Normal Equation
# beta = (X^T X)^-1 X^T y
XT = X.T
XTX = XT @ X
XTX_inv = np.linalg.inv(XTX)
beta = XTX_inv @ XT @ y

# The intercept is the first element of beta
y_intercept = beta[0]
# The slopes are the subsequent elements
slope_food = beta[1]
slope_ambience = beta[2]
slope_service = beta[3]

print(f'slope_food (β₁): {slope_food:.4f}')
print(f'slope_ambience (β₂): {slope_ambience:.4f}')
print(f'slope_service (β₃): {slope_service:.4f}')
print(f'y_intercept (β₀): {y_intercept:.4f}')

slope_food (β₁): 0.7338
slope_ambience (β₂): 0.1338
slope_service (β₃): 0.3240
y_intercept (β₀): -15.8713


In [13]:
# Test out formula
predicted = []
for i in range(len(df)):
    x1 = df['food'].iloc[i]
    x2 = df['ambience'].iloc[i]
    x3 = df['service'].iloc[i]
    y = y_intercept + slope_food*x1 + slope_ambience*x2 + slope_service*x3
    predicted.append(y)
    epsilon = y - df['rating'].iloc[i]
    print(f'With x={x1}, {x2}, {x3} the predicted y is {y:.3f} with an error of {epsilon:.3f}')
df['predicted'] = predicted

With x=85, 82, 89 the predicted y is 86.302 with an error of 8.302
With x=80, 90, 80 the predicted y is 80.788 with an error of -4.212
With x=83, 86, 83 the predicted y is 83.426 with an error of -1.574
With x=70, 96, 75 the predicted y is 72.633 with an error of 0.633
With x=68, 80, 78 the predicted y is 69.997 with an error of -5.003
With x=65, 70, 56 the predicted y is 59.330 with an error of 5.330
With x=64, 68, 61 the predicted y is 59.949 with an error of -2.051
With x=72, 95, 72 the predicted y is 72.994 with an error of -0.006
With x=69, 70, 78 the predicted y is 69.393 with an error of -0.607
With x=75, 80, 75 the predicted y is 74.161 with an error of -2.839
With x=75, 70, 75 the predicted y is 72.824 with an error of -1.176
With x=72, 90, 78 the predicted y is 74.270 with an error of -1.730
With x=81, 72, 78 the predicted y is 78.466 with an error of -1.534
With x=71, 91, 71 the predicted y is 71.402 with an error of 0.402
With x=67, 86, 78 the predicted y is 70.066 with an 

## Statsmodel OLS

In [9]:
import statsmodels.formula.api as smf

model = smf.ols('rating ~ food + ambience + service', data=df)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.738
Method:                 Least Squares   F-statistic:                     14.16
Date:                Sat, 12 Apr 2025   Prob (F-statistic):           0.000428
Time:                        21:29:31   Log-Likelihood:                -40.685
No. Observations:                  15   AIC:                             89.37
Df Residuals:                      11   BIC:                             92.20
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -15.8713     14.700     -1.080      0.3

  res = hypotest_fun_out(*samples, **kwds)
