# Orders - Multivariate Regression of review_score

Let's recall our simple analysis from yesterday's exercice: 

Based on our correlation matrix below, we noticed that `review_score` is mostly correlated with the two features `wait_time` and `delay_vs_expected`. However, these two features are also highly correlated together. In this exercise, we will use `Excel Data Analyze` to distinguish the effect of one feature, **holding the other one constant**.

![data_columns](correlations.png)

## 1 - Univariate regression

❓Use [Linear Regression](https://www.excel-easy.com/examples/regression.html) with `Data Analyze` to quickly make
 - `model1`: an ols regression of `review_score` over `wait_time`
 - `model2`: an ols regression of `review_score` over `delay_vs_expected`

For each, read and interpret the results:
- Make sure to understand how these results and try to make a scatterplot with the regression line
- Read the regression performance metric R-squared, as well as individual regression coefficients, t-values, p-values, and 95% confidence intervals

![data_columns](graphs.png)

## 2 - Multivariate regression - 2 variables

❓ What is the impact on `review_score` of adding one day of `delay_vs_expected` to the order, **holding `wait_time` constant**? Which of the two features is the most explicative for the low `review_score`?

For that purpose, run an OLS model `model3` where both `wait_time` and `delay_vs_expected` are the features (independent variables), and `review_score` is the target (dependent variable)


❓Our multivariate regression allows us to isolate the impact of one feature, while controlling the effect of other features. These new coefficients are called **partial correlation** coefficients. Can you notice the difference with the **simple regression** coeffients calculated above? What can you say about the relative slopes for `wait_time` and `delay_vs_expected`? 


<details>
    <summary>💡 Solution</summary>

- Holding `wait_time` constant, each additional day of `delay` reduces the review_score on average by 0.0158 points 
- Holding `delay` constant, each additional day of `wait_time` reduces the review_score on average by -0.0400 points 

Contrary to what was found with the simple bivariate correlation analysis, `delay` is actually less impactful than `wait_time` in driving lower `review_score`! This interesting finding demonstrates the importance of multi-variate regression to remove the potential impact of confounding factors

## 3 - Multivariate regression - 2+


❓ R-squared is quite low: no more than 12% (0.1137) of the variations of review_score is explained by the combined variations of `wait_time` and `delay_vs_expected`. Let's try to add more features to our regression to improve explainability.

- Create a new OLS `model4` with more features from `orders` dataset.
    - Do not to create **data leaks**: do not add features that are directly derived from the `review_score`
    - Do not add two features perfectly colinear with each other
    - Transform each feature $X_i$ into its respective z-score $Z_i = \frac{X_i - \mu_i}{\sigma_i}$ in order to compare the partial regression coefficients $\beta_i$ together. Otherwise, the $\beta_i$ are not of the same dimension, meaning you'll be comparing apples (e.g. "review_stars per day") with oranges (e.g. "review_stars per BRL")!

In [None]:
# Select features

In [None]:
# standardize features (transform them into their respective z-scores)

In [None]:
# Create and train model


❓ What are the most important features? (make a bar chart to visualize them well)
- How has the overall regression performance changed?
- Is this regression statistically significant?
- In the regressions menu polot the line fit plot in order to see the predictions and the real values for each feature

In [5]:
# Your answer

<details>
    <summary>💡Explanations</summary>
    

- `delay vs expected` is the biggest explanatory variable

- Depending on your choice of feature, you may not be able to conclude anything about `price` and `freight_value` if their p-values are too high
    
- Overall, this multivariate regression remains statistically significant, because its F-statistics are much greater than 1 (at least one feature has a very low p-value)

- R-squared hasn't increased by much. Most of the explanability of review_score lies outside of the orders dataset.

Low R-squared is common when the number of observations (n) is much higher than the number of features (p). Relevant insights can still be derived from such regressions, provided they are statistically significant
</details>



## 3 - Check model performance

You can also plot the distribution of the predictions and the residuals in order to have a better understanding of your model.

<details>
    <summary>💡Explanations</summary>

☝️ Our model is not so great, for two reasons
- First, because we don't have enough features to explain a significant proportion of the review_scores (low R-squared)
- Secondly, because we are trying to fit a "linear regression" function to a discreet classification problem

</details>