In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [30]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

👉 Import your dataset:

In [5]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

In [6]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

In [13]:
# YOUR CODE HERE


🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [9]:
# YOUR CODE HERE
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

orders.corr().style.background_gradient(cmap = 'coolwarm')


Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
wait_time,1.0,0.385628,0.702597,-0.234101,0.305577,-0.334036,-0.019754,-0.040702,0.055638,0.167284,0.394982
expected_wait_time,0.385628,1.0,0.005519,-0.050333,0.034842,-0.052525,0.015735,0.024884,0.076606,0.238748,0.513563
delay_vs_expected,0.702597,0.005519,1.0,-0.156735,0.284706,-0.272361,-0.013653,-0.017162,0.016632,0.023887,0.066069
dim_is_five_star,-0.234101,-0.050333,-0.156735,1.0,-0.396354,0.791749,-0.07227,-0.070536,-0.012762,-0.058773,-0.056566
dim_is_one_star,0.305577,0.034842,0.284706,-0.396354,1.0,-0.807758,0.119848,0.102241,0.04466,0.082778,0.043185
review_score,-0.334036,-0.052525,-0.272361,0.791749,-0.807758,1.0,-0.12334,-0.117017,-0.034538,-0.090014,-0.059147
number_of_products,-0.019754,0.015735,-0.013653,-0.07227,0.119848,-0.12334,1.0,0.288734,0.153551,0.438056,-0.017308
number_of_sellers,-0.040702,0.024884,-0.017162,-0.070536,0.102241,-0.117017,0.288734,1.0,0.042986,0.13358,-0.007704
price,0.055638,0.076606,0.016632,-0.012762,0.04466,-0.034538,0.153551,0.042986,1.0,0.410129,0.079356
freight_value,0.167284,0.238748,0.023887,-0.058773,0.082778,-0.090014,0.438056,0.13358,0.410129,1.0,0.314197


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [49]:
# YOUR CODE HERE
import scipy.stats as stats
sa_orders = pd.DataFrame()
features = ['wait_time','expected_wait_time','delay_vs_expected','dim_is_five_star','dim_is_one_star','review_score','number_of_products']
sa_orders[['wait_time']]= orders[['wait_time']].apply(stats.zscore)
sa_orders[['expected_time']]= orders[['wait_time']].apply(stats.zscore)
sa_orders[['freight_value']]=orders[['freight_value']].apply(stats.zscore)
sa_orders[['price']]=orders[['price']].apply(stats.zscore)
sa_orders[['distance_seller_customer']]=orders[['distance_seller_customer']].apply(stats.zscore)
sa_orders[['number_of_sellers']]=orders[['number_of_sellers']].apply(stats.zscore)
sa_orders[['dim_is_one_star']]=orders[['dim_is_one_star']]
sa_orders[['dim_is_five_star']]=orders[['dim_is_five_star']]
sa_orders[['delay_vs_expected']]=orders[['delay_vs_expected']].apply(stats.zscore)

sa_orders


Unnamed: 0,wait_time,expected_time,freight_value,price,distance_seller_customer,number_of_sellers,dim_is_one_star,dim_is_five_star,delay_vs_expected
0,-0.431195,-0.431195,-0.652042,-0.513805,-0.979509,-0.112545,0,0,-0.161782
1,0.134174,0.134174,0.000467,-0.086641,0.429765,-0.112545,0,0,-0.161782
2,-0.329909,-0.329909,-0.164054,0.111749,-0.145496,-0.112545,0,1,-0.161782
3,0.073540,0.073540,0.206816,-0.441528,2.054708,-0.112545,0,1,-0.161782
4,-1.019540,-1.019540,-0.652042,-0.562391,-0.959149,-0.112545,0,1,-0.161782
...,...,...,...,...,...,...,...,...,...
95875,-0.454311,-0.454311,-0.449411,-0.311515,-0.893064,-0.112545,0,1,-0.161782
95876,1.023847,1.023847,-0.123156,0.183978,-0.212800,-0.112545,0,0,-0.161782
95877,1.305787,1.305787,1.964500,0.333686,0.617659,-0.112545,0,1,-0.161782
95878,0.483667,0.483667,2.715536,1.075192,-0.387569,-0.112545,0,0,-0.161782


👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [41]:
# YOUR CODE HERE
df = pd.DataFrame()
df['vif_index'] = [vif(sa_orders.values,i)for i in range(sa_orders.shape[1])]
df['features'] = sa_orders.columns
df


  vif = 1. / (1. - r_squared_i)


Unnamed: 0,vif_index,features
0,inf,wait_time
1,inf,expected_time
2,1.364553,freight_value
3,1.206836,price
4,1.420951,distance_seller_customer
5,1.039425,number_of_sellers
6,1.126203,dim_is_one_star
7,1.0298,dim_is_five-star
8,2.225828,delay_vs_expected


`Logit 5️⃣`

In [39]:
# YOUR CODE HERE
logit_one = smf.logit(formula='dim_is_one_star ~ delay_vs_expected + wait_time',data = sa_orders).fit();
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.283196
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95869.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 21 Oct 2021",Pseudo R-squ.:,0.1147
Time:,14:51:15,Log-Likelihood:,-27151.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.3990,0.012,-193.843,0.000,-2.423,-2.375
delay_vs_expected,0.3367,0.018,19.082,0.000,0.302,0.371
wait_time,0.5589,0.015,38.510,0.000,0.530,0.587


In [50]:
logit_five = smf.logit(formula='dim_is_five_star ~ wait_time + delay_vs_expected + wait_time + price',data = sa_orders).fit();
logit_five.params

Optimization terminated successfully.
         Current function value: 0.642678
         Iterations 7


Intercept            0.336675
wait_time           -0.440557
delay_vs_expected   -0.492699
price               -0.000837
dtype: float64

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [56]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [57]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/selmalopez/.pyenv/versions/lewagon_current/bin/python3
cachedir: .pytest_cache
rootdir: /Users/selmalopez/code/selmalopez/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>- <i>Explanations</i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through the `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Make sure both sets of coefficients  tell  "the same story".

**`Linear Regression`** of the Review score w.r.t. selected features :

1️⃣ Fit the Linear Regression:

In [26]:
# YOUR CODE HERE


2️⃣ Print its summary:

In [30]:
# YOUR CODE HERE

3️⃣ Print the summary of the `logit_five` 

In [31]:
# YOUR CODE HERE

4️⃣ Compare `logit_five` and `linear_regression` regression coefficients.

<details>
    <summary>- <i>Hints</i> -</summary>


* Plot a sorted horizontal barchat of the regression cofficients for each model
* Plot them side-by-side !
    
</details>


In [28]:
# YOUR CODE HERE

<details>
    <summary><i> - Explanations -</i></summary>


* A side-by-side comparison of the linear regression on `review_score` and the logistic regression on `dim_is_five_star` clearly shows that : <br/>
    The most important feature when it comes to  `review_score` and `dim_is_five_star` is the same :`wait_time` (surprised ? probably not, but at least this is confirmed statistically !)_
    
</details>

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !