In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [62]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np

In [87]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features to be used:

⚠️ Make sure no data leakage is created (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in the list all features that may be relevant

In [91]:
x_var = ['wait_time','expected_wait_time','delay_vs_expected','number_of_products','number_of_sellers','price','freight_value','distance_seller_customer']
y_val = ['dim_is_five_star','dim_is_five_star']
orders.head(3)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.0,15.0,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.0,19.0,0.0,delivered,0,0,4,1,1,118.7,22.76,856.29258
2,47770eb9100c2d0c44946d9cf07ec65d,9.0,26.0,0.0,delivered,1,0,5,1,1,159.9,19.22,514.130333


🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [95]:
base_scaled = orders[x_var].copy()
for feature in base_scaled.columns:
    mu = base_scaled[feature].mean()
    sigma = base_scaled[feature].std()
    base_scaled[feature] = base_scaled[feature].apply(lambda x: (x-mu)/sigma)

👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [112]:
base_scaled.head(3)

Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,-0.428002,-0.955662,-0.160462,-0.264595,-0.112544,-0.513802,-0.652038,-0.979475
1,0.100519,-0.499255,-0.160462,-0.264595,-0.112544,-0.08664,0.000467,0.429743
2,-0.322297,0.299456,-0.160462,-0.264595,-0.112544,0.111748,-0.164053,-0.145495


In [160]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
# compute VIF factor for feature index 0
vif(base_scaled.values, 1)

1.5974166673318788

In [114]:
df = pd.DataFrame()

df["features"] = base_scaled.columns

df["vif_index"] = [vif(base_scaled.values, i) for i in range(base_scaled.shape[1])]

round(df.sort_values(by="vif_index", ascending = False),2)

Unnamed: 0,features,vif_index
0,wait_time,3.04
2,delay_vs_expected,2.43
6,freight_value,1.68
7,distance_seller_customer,1.6
1,expected_wait_time,1.6
3,number_of_products,1.37
5,price,1.21
4,number_of_sellers,1.1


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [140]:
logit_one = smf.logit(formula='dim_is_one_star ~ wait_time + delay_vs_expected', data=orders).fit()
logit_one.params

Optimization terminated successfully.
         Current function value: 0.283270
         Iterations 7


Intercept           -3.169822
wait_time            0.059533
delay_vs_expected    0.071306
dtype: float64

In [158]:
prob = ['probability']
exp = ['odds']
for param in logit_one.params:
    exp.append(np.exp(param))
    prob.append(np.exp(param)/(1+np.exp(param)))
print(exp)
print(prob)

['odds', 0.042011058359046075, 1.0613404864394242, 1.0739097228874708]
['probability', 0.040317286483700934, 0.5148787856355983, 0.5178189344675446]


`Logit 5️⃣`

In [153]:
logit_five = smf.logit(formula='dim_is_five_star ~ wait_time + delay_vs_expected', data=orders).fit()
logit_five.params

Optimization terminated successfully.
         Current function value: 0.642767
         Iterations 7


Intercept            0.978416
wait_time           -0.046713
delay_vs_expected   -0.105488
dtype: float64

In [159]:
prob2 = ['probability']
exp2 = ['odds']
for param in logit_five.params:
    exp2.append(np.exp(param))
    prob2.append(np.exp(param)/(1+np.exp(param)))
print(exp2)
print(prob2)

['odds', 2.6602389984172503, 0.9543610532687616, 0.8998851539512686]
['probability', 0.726793796680376, 0.48832381901621885, 0.47365239529333697]


💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [51]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [52]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/ysin/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/ysin/code/yongsin91/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through a `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [155]:
model3 = smf.ols(formula = "review_score ~ wait_time + delay_vs_expected", data = orders).fit()
model3.params

Intercept            4.633688
wait_time           -0.038444
delay_vs_expected   -0.020090
dtype: float64

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !