# Simple orders analysis

We are finally ready to start analysing our order dataset!

Our objectif is to get an initial understanding of
- Orders properties
- Their associated `review_scores`

In [1]:
#import modules 
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
# import your newly coded order training set
from olist.order import Order
data = Order().get_training_data()

## 1 - Inspect features

❓ Print summary statistics of each columns of the order dataset, and in particular `wait_time`

<details>
    <summary>Hint</summary>
DataFrame.describe()
</details>

Plot various histograms to get a sens of each variable distribution.
Try also to distinguish histograms distributions by `review_score`
<details>
    <summary>Hint</summary>
You may use sns.FacetGrid() to create easily a grid of subplots histograms for each review score
</details>

What do you notice for variables `distance_seller_customer`, `price` and `freight_value` ?

In [36]:
# Your plots here

----
❓Inspect the various correlation between features as much as possible. Which features seems most correlated with `review_score`?

<details>
    <summary>Hint</summary>
You may try

- `DataFrame.corr()` combined with `sns.heatmap()
- Various plots, with variating hue etc...
    - scatterplot
    - `sns.pairplot()` for the whole dataframe (takes time!)
</details>

In [44]:
# Your plots here

## 2 - Simple regression of `review_score` against delivery duration

It seems that `review_score` is mostly correlated with `wait_time` (r = 33%) and `delay_vs_expected` (r=27%).
Let's investigate these relationship more closely with seaborn

### 2.1 Plots
❓ In one figure, create 2 subplots, that regress `review_score` on `wait_time` and `delay_vs_expected` respectively

Hints:
- Use `sns.regplot()` to plot the regression line
- Reduce your dataframe to a random subsample of 10,000 row out of 100,000 for speed purpose (a good practice in data exploration phase)
    - use `DataFrame.sample()` with a fixed `random_state` to avoid changing sample at each execution if needed
- Don't hesitate to zoom on plausible values by limit `xlim` and `ylim` to hide outliers
- Add some `y_jitter` parameters to better visualize scatterplot density

In [1]:
# SUB-SAMPLE YOUR DATASET

In [None]:
# YOUR PLOT HERE

### 2.2 Interpretation

❓Try to compute visually the `slope` of each curves. 

Write down, in plain english, how would you interpret these coefficient, if you were to explain it to someone not fluent in math.

✏️ Your answer below:


- Slope wait_time = ??? : #your interpretation here
- Slope delay = ??? : #your interpretation here


---- 
**Let's step back**

These slope coefficient have been computed only for a limited sample of order transaction. 100,000 for the whole dataset, or much less if you randomly sub-sampled it. 

How certain can we be that these slope coefficients are statistically significant? i.e that they are not statistical artefacts of the sample dataset?

After all, our kaggle dataset is only for 2017, and may well have been sampled from a bigger list of orders?
Can we be confident these coefficient may hold true as new orders are placed?

We need to estimate **confidence interval** around the mean value for these slope  
$$slope_{wait} = -0.05 ± \ ?? \ \text{[95% interval]} $$
$$slope_{delay} = -0.1 ± \ ?? \ \text{[95% interval]} $$

Fortunately, seaborn already computes this 95% confidence interval for us the **shaded blue cone** around regression line!

----
❓Your time to plot:
- First, convince yourself that the slope coefficient depends on the size of the sample by sub-sampling your dataset into very small size. Notice how the slope coefficient may even sometime become positive for very small sub-samples, wrongly suggesting the opposite relation.
- Second, make use the full dataset to visualize the 95% confidence interval with seaborn
- Third, change the size of the confidence interval by playing with regplot `ci` parameter

In [None]:
# YOUR PLOT HERE

----
**Conclusion**

- The 95% confidence interval for the slope does not contains the value 0.
- We are 95% confident that slower deliveries are associated with weaker reviews.
- The `p-value` associated with the null-hypothesis "review_score are not related with delivery duration" would be quite low, and we could safely reject it. 

$\implies$ Our results are said to be **statistically significant**. 

However, **correlation does not implies causality**. It may well be that some products, which happen to be inherently slow to deliver on average (heavy ones maybe?), also happen to have consistent low review_score, whatever time it takes to be delivered. Identifying these **counfounding factor** is crucial and cannot be done with simple univariate regression. We will see tomorrow the power of multivariate linear regression for that matter. 