# Exploratory Analysis

In [1]:
%load_ext autoreload
%autoreload 2

In [25]:
import pandas as pd

In [26]:
from olist.data import Olist
data = Olist().get_data()

### 1 - Run an exploratory analysis with [pandas profiling](https://github.com/pandas-profiling/pandas-profiling)

In [27]:
# First, let's install the pandas-profiling package
! pip install pandas-profiling

You should consider upgrading via the '/Users/selmalopez/.pyenv/versions/3.8.12/envs/lewagon_current/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [28]:
# And create a new "04-Decision-Science/reports" folder 
!mkdir -p ../../data/reports

In [29]:
import pandas_profiling
datasets_to_profile = ['orders', 'products', 'sellers',
                  'customers', 'order_reviews',
                  'order_items']

👉 Create and save one `html report` per dataset to profile 

⏳ (It usually takes a few minutes)

In [30]:
data["orders"]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


In [31]:
# YOUR CODE HERE
for key in datasets_to_profile:
    profile = pandas_profiling.ProfileReport(data[key],title=key)
    profile.to_file(f"/home/lewagon/code/SelmaLopez/data-challenges/04-Decision-Science/data/reports/{key}.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

FileNotFoundError: [Errno 2] No such file or directory: '/home/lewagon/code/SelmaLopez/data-challenges/04-Decision-Science/data/reports/orders.html'

Take 10 min to read the reports 📈📊, and feel free to add insights of your choice to your `db.lewagon.org schema`

### 2 - Create a matching table

Looking at our schema, it would be nice to create a central `matching_table` that will join the most important foreign keys together, for later use.

❓Create the `matching_table`, a DataFrame with the following columns (below).  
Use outer joins to make sure you don't lose any information at this stage.

In [None]:
columns_matching_table = [
    "order_id",
    "review_id",
    "customer_id",
    "product_id",
    "seller_id",
]

👉 To create this `matching_table`, select carefully the columns of interest in some Olist datasets:

In [None]:
# YOUR CODE HERE
data["orders"][["order_id"]]
data["customers"][["customer_id"]]
data["order_reviews"][["review_id"]]
data["sellers"][["seller_id"]]
data["products"][['product_id']] 

👀 Inspect the cardinality of each DataFrame using `pd.DataFrame.shape` and `pd.Series.nunique()`

In [None]:
# YOUR CODE HERE
data["orders"].shape
data["order_items"].shape
data["customers"].nunique

🧨 Merge these Dataframes carefully to create the `matching_table`:

In [39]:
# YOUR CODE HERE
from olist.data import Olist
data = Olist().get_data()
columns_matching_table = ["order_id", "review_id", "customer_id","product_id", "seller_id",]
a = pd.merge(data['orders'], data['order_reviews'], how = 'outer', on = 'order_id')
b = pd.merge(a, data['customers'], how = 'outer', on = 'customer_id')
c = pd.merge(b, data['order_items'], how = 'outer', on = 'order_id')
d = pd.merge(c, data['products'], how = 'outer', on = 'product_id')
e = pd.merge(c, data['sellers'], how = 'outer', on = 'seller_id')
e[columns_matching_table].drop_duplicates()

Unnamed: 0,order_id,review_id,customer_id,product_id,seller_id
0,e481f51cbdc54678b7cc49136f2d6af7,a54f0611adc9ed256b57ede6b6eb5114,9ef432eb6251297304e76186b10a928d,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9
1,8736140c61ea584cb4250074756d8f3b,b8238c6515192f8129081e17dc57d169,ab8844663ae049fda8baf15fc928f47f,b00a32a0b42fd65efb58a5822009f629,3504c0cb71d7fa48d967e0e4c94d59d9
2,a0151737f2f0c6c0a5fd69d45f66ceea,fa5bf792d42ed25f80c54d18aeaa83de,fc2697314ab7fbeda62bb6f1afa4efcd,725cbfcaff95a4d43742fdf13cf43c75,3504c0cb71d7fa48d967e0e4c94d59d9
3,a3bf941183211246f0d42ad757cba127,5fe3f65882f521f0fbc41e59b9fbff24,3718e1873d5dc3e8d96c0ab783278b02,725cbfcaff95a4d43742fdf13cf43c75,3504c0cb71d7fa48d967e0e4c94d59d9
4,1462290799412b71be32dd880eaf4e1b,36f6f70dc5e67e8e43b109cae0e92d40,220e4b027f0294fd79d2869ef67e7db6,d7faab3fa0091d1220a8ada9cae1bab3,3504c0cb71d7fa48d967e0e4c94d59d9
...,...,...,...,...,...
114087,1ab38815794efa43d269d62b98dae815,7f9849fcbfdf9fa3070c05b5501bf066,a0b67404d84a70ef420a7f99ad6b190a,31ec3a565e06de4bdf9d2a511b822b4d,babcc0ab201e4c60188427cae51a5b8b
114088,b159d0ce7cd881052da94fa165617b05,c950324a42c5796d06f569f77d8b2e88,e0c3bc5ce0836b975d6b2a8ce7bb0e3e,241a1ffc9cf969b27de6e72301020268,8501d82f68d23148b6d78bb7c4a42037
114089,735dce2d574afe8eb87e80a3d6229c48,19f21ead7ffe5b1b5147a7877c22bae5,d531d01affc2c55769f6b9ed410d8d3c,1d187e8e7a30417fda31e85679d96f0f,d263fa444c1504a75cbca5cc465f592a
114090,25d2bfa43663a23586afd12f15b542e7,ec2817e750153dfdd61894780dfc5d9e,9d8c06734fde9823ace11a4b5929b5a7,6e1c2008dea1929b9b6c27fa01381e90,edf3fabebcc20f7463cc9c53da932ea8


❓Does this `matching_table` have duplicated rows ?  How many duplicates ?

* If so, what could be the reason(s) ?
    * Delete the duplicated rows.

<details>
    <summary>- <i>Hints</i></summary> 
    
* For a given `order_id`, the quantity of a given `product_id` bought can be greater than 1
* In the `items` table, each individual product bought appears as an additional row. 

</details>

<details>
    <summary>- <i>Technical hints</i></summary> 
    
You have two options:    
    
* `.duplicated()`
* `drop_duplicates()`

</details>

In [42]:
# YOUR CODE HERE


👉Inspect the shape and the number of uniques values in  of the final DataFrame - *Hint*: use `nunique()`

🎯 It should match (103805, 5)

In [43]:
# YOUR CODE HERE
e.shape

(114092, 27)

### 3 - Save your logic in `data.py` 

❓Copy your logic into `get_matching_table()` in `data.py` and run the cell below to check your code!

In [44]:
from nbresult import ChallengeResult
from olist.data import Olist

data = Olist().get_matching_table()

result = ChallengeResult('matching_table',
    shape=data.shape,
    columns=sorted(list(data.columns)) 
)
result.write()
print(result.check())

AttributeError: 'NoneType' object has no attribute 'shape'