In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - Logistic Regression (~1h)

## Select features

üéØ Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

üëâ Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [2]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

üëâ Import your dataset:

In [3]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

üëâ Select in a list which features you want to use:

‚ö†Ô∏è Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

üí° To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [7]:
features = [
    'wait_time',
    'delay_vs_expected',
    'price',
    'freight_value',
    'distance_seller_customer',
    'number_of_items'
]
orders.corr

<bound method DataFrame.corr of                                order_id  wait_time  expected_wait_time  \
0      e481f51cbdc54678b7cc49136f2d6af7   8.436574           15.544063   
1      53cdb2fc8bc7dce0b6741e2150273451  13.782037           19.137766   
2      47770eb9100c2d0c44946d9cf07ec65d   9.394213           26.639711   
3      949d5b44dbf5de918fe9c16f97b45f8a  13.208750           26.188819   
4      ad21c59c0840e6cb83a9ceb5573f8159   2.873877           12.112049   
...                                 ...        ...                 ...   
95875  9c5dedf39a927c1b2549525ed64a053c   8.218009           18.587442   
95876  63943bddc261676b46f01ca7ac2f7bd8  22.193727           23.459051   
95877  83c1379a015df1e13d02aae0204711ab  24.859421           30.384225   
95878  11c177c8e97725db2631073c19f07b62  17.086424           37.105243   
95879  66dea50a8b16d9b4dee7af250b4be1a5   7.674306           25.126736   

       delay_vs_expected order_status  dim_is_five_star  dim_is_one_star  \
0  

üïµüèª Check the `multicollinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
üìö <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

üìö  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

‚öñÔ∏è Standardizing:

In [9]:

orders_scaled = orders[features].copy()

for feature in orders_scaled.columns:
    mu = orders_scaled[feature].mean()
    sigma = orders_scaled[feature].std()
    orders_scaled[feature] = orders_scaled[feature].apply(lambda x: (x - mu) / sigma)

orders_scaled.head()

Unnamed: 0,wait_time,delay_vs_expected,price,freight_value,distance_seller_customer,number_of_items
0,-0.431192,-0.161781,-0.513802,-0.652038,-0.979475,-0.264595
1,0.134174,-0.161781,-0.08664,0.000467,0.429743,-0.264595
2,-0.329907,-0.161781,0.111748,-0.164053,-0.145495,-0.264595
3,0.07354,-0.161781,-0.441525,0.206815,2.054621,-0.264595
4,-1.019535,-0.161781,-0.562388,-0.652038,-0.959115,-0.264595


üëâ Run your VIF Analysis to analyze the potential multicollinearities:

In [12]:
# 1Ô∏è‚É£ Import n√©cessaire
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
import pandas as pd

# 2Ô∏è‚É£ Calcul du VIF
vif_df = pd.DataFrame()
vif_df["features"] = orders_scaled.columns
vif_df["vif_index"] = [vif(orders_scaled.values, i) for i in range(orders_scaled.shape[1])]

# 3Ô∏è‚É£ Tri et affichage
round(vif_df.sort_values(by="vif_index", ascending=False), 2)

Unnamed: 0,features,vif_index
0,wait_time,2.62
1,delay_vs_expected,2.21
3,freight_value,1.67
4,distance_seller_customer,1.44
5,number_of_items,1.28
2,price,1.21


## Logistic Regressions

üëâ Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1Ô∏è‚É£`

In [17]:
orders_scaled['dim_is_one_star'] = orders['dim_is_one_star']
logit_one = smf.logit(formula='dim_is_one_star ~ wait_time + delay_vs_expected + price + freight_value + distance_seller_customer + number_of_items', data=orders_scaled).fit()
logit_one.params


Optimization terminated successfully.
         Current function value: 0.276012
         Iterations 7


Intercept                  -2.449757
wait_time                   0.686900
delay_vs_expected           0.267251
price                       0.049164
freight_value              -0.018667
distance_seller_customer   -0.171774
number_of_items             0.301781
dtype: float64

`Logit 5Ô∏è‚É£`

In [18]:
orders_scaled['dim_is_five_star'] = orders['dim_is_five_star']
logit_five = smf.logit(formula='dim_is_five_star ~ wait_time + delay_vs_expected + price + freight_value + distance_seller_customer + number_of_items', data=orders_scaled).fit()
logit_five.params

Optimization terminated successfully.
         Current function value: 0.638779
         Iterations 7


Intercept                   0.338796
wait_time                  -0.511422
delay_vs_expected          -0.438229
price                       0.021506
freight_value               0.002551
distance_seller_customer    0.084898
number_of_items            -0.177838
dtype: float64

üí° It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [20]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more than one_star"

your_answer = [a]

üß™ __Test your code__

In [21]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())


platform darwin -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /Users/simonhingant/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/simonhingant/code/simsam56/03-Decision-Science/03-Logistic-Regression/data-logit/tests
plugins: anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
‚ùóÔ∏è However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tend to give 5-stars easily) are less sensitive than "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


üèÅ Congratulations! 

üíæ Don't forget to commit and push your `logit.ipynb` notebook !