# Exploratory Analysis

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from olist.data import Olist
data = Olist().get_data()
data

{'sellers':                              seller_id  seller_zip_code_prefix  \
 0     3442f8959a84dea7ee197c632cb2df15                   13023   
 1     d1b65fc7debc3361ea86b5f14c68d2e2                   13844   
 2     ce3ad9de960102d0677a81f5d0bb7b2d                   20031   
 3     c0f3eea2e14555b6faeea3dd58c1b1c3                    4195   
 4     51a04a8a6bdcb23deccc82b0b80742cf                   12914   
 ...                                ...                     ...   
 3090  98dddbc4601dd4443ca174359b237166                   87111   
 3091  f8201cab383e484733266d1906e2fdfa                   88137   
 3092  74871d19219c7d518d0090283e03c137                    4650   
 3093  e603cf3fec55f8697c9059638d6c8eb5                   96080   
 3094  9e25199f6ef7e7c347120ff175652c3b                   12051   
 
             seller_city seller_state  
 0              campinas           SP  
 1            mogi guacu           SP  
 2        rio de janeiro           RJ  
 3             sao paul

### 1 - Run an exploratory analysis with [pandas profiling](https://github.com/pandas-profiling/pandas-profiling)

In [3]:
# First, let's install the pandas
# First, let's install the pandas-profiling package
! pip install pandas-profiling

You should consider upgrading via the '/Users/shu/.pyenv/versions/3.8.12/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [4]:
# And create a new "04-Decision-Science/reports" folder 
!mkdir -p ../../data/reports

In [5]:
import pandas_profiling
datasets_to_profile = ['orders', 'products', 'sellers',
                  'customers', 'order_reviews',
                  'order_items']

👉 Create and save one `html report` per dataset to profile 

⏳ (It usually takes a few minutes)

In [15]:
for d in datasets_to_profile:
    print('exporting: '+d)
    profile = data[d].profile_report(title='Report for '+d)
    profile.to_file(output_file="../../data/reports/"+d+'.html');

exporting: orders


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

KeyboardInterrupt: 

Take 10 min to read the reports 📈📊, and feel free to add insights of your choice to your `db.lewagon.org schema`

### 2 - Create a matching table

Looking at our schema, it would be nice to create a central `matching_table` that will join the most important foreign keys together, for later use.

❓Create the `matching_table`, a DataFrame with the following columns (below).  
Use outer joins to make sure you don't lose any information at this stage.

In [17]:
columns_matching_table = [
    "order_id",
    "review_id",
    "customer_id",
    "product_id",
    "seller_id",
]
columns_matching_table

['order_id', 'review_id', 'customer_id', 'product_id', 'seller_id']

👉 To create this `matching_table`, select carefully the columns of interest in some Olist datasets:

In [7]:
# Select only the columns of interest in the various dataframes of interest, before proceeding to any merge
orders = data['orders'][['customer_id', 'order_id']]
reviews = data['order_reviews'][['order_id', 'review_id']]
items = data['order_items'][['order_id', 'product_id','seller_id']]
items[1:30]

Unnamed: 0,order_id,product_id,seller_id
1,00018f77f2f0320c557190d7a144bdd3,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36
2,000229ec398224ef6ca0657da4fc703e,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d
3,00024acbcdf0a6daa1e931b038114c75,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4
4,00042b26cf59d7ce69dfabb4e55b4fd9,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87
5,00048cc3ae777c65dbb7d2a0634bc1ea,ef92defde845ab8450f9d70c526ef70f,6426d21aca402a131fc0a5d0960a3c90
6,00054e8431b9d7675808bcb819fb4a32,8d4f2bb7e93e6710a28f34fa83ee7d28,7040e82f899a04d1b434b795a43b4617
7,000576fe39319847cbb9d288c5617fa6,557d850972a7d6f792fd18ae1400d9b6,5996cddab893a4652a15592fb58ab8db
8,0005a1a1728c9d785b8e2b08b904576c,310ae3c140ff94b03219ad0adc3c778f,a416b6a846a11724393025641d4edd5e
9,0005f50442cb953dcd1d21e1fb923495,4535b0e1091c278dfd193e5a1d63b39f,ba143b05f0110f0dc71ad71b4466ce92
10,00061f2a7bc09da83e415a52dc8a4af1,d63c1011f49d98b976c352955b1c4bea,cc419e0650a3c5ba77189a1882b7556a


👀 Inspect the cardinality of each DataFrame using `pd.DataFrame.shape` and `pd.Series.nunique()`

In [8]:
print('orders:', orders.shape, orders.customer_id.nunique(), 'unique customer_ids, and', orders.order_id.nunique(), 'unique order_ids')
print('review: ', reviews.shape, reviews.order_id.nunique(), 'unique order_ids and', reviews.review_id.nunique(), 'unique reviews' )
print('items: ', items.shape, items.order_id.nunique(), 'unique order_ids,', items.product_id.nunique(), 
      'unique product_ids, and', items.seller_id.nunique(), 'unique seller_ids')

orders: (99441, 2) 99441 unique customer_ids, and 99441 unique order_ids
review:  (99224, 2) 98673 unique order_ids and 98410 unique reviews
items:  (112650, 3) 98666 unique order_ids, 32951 unique product_ids, and 3095 unique seller_ids


🧨 Merge these Dataframes carefully to create the `matching_table`:

In [9]:
# Carefully merge DataFrames
matching_table = orders.merge(reviews, on='order_id', how='outer').merge(items, on='order_id', how='outer')
matching_table

Unnamed: 0,customer_id,order_id,review_id,product_id,seller_id
0,9ef432eb6251297304e76186b10a928d,e481f51cbdc54678b7cc49136f2d6af7,a54f0611adc9ed256b57ede6b6eb5114,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9
1,b0830fb4747a6c6d20dea0b8c802d7ef,53cdb2fc8bc7dce0b6741e2150273451,8d5266042046a06655c8db133d120ba5,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962
2,41ce2a54c0b03bf3443c3d931a367089,47770eb9100c2d0c44946d9cf07ec65d,e73b67b67587f7644d5bd1a52deb1b01,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2
3,f88197465ea7920adcdbec7375364d82,949d5b44dbf5de918fe9c16f97b45f8a,359d03e676b3c069f62cadba8dd3f6e8,d0b61bfb1de832b15ba9d266ca96e5b0,66922902710d126a0e7d26b0e3805106
4,8ab97904e6daea8866dbdbc4fb7aad2c,ad21c59c0840e6cb83a9ceb5573f8159,e50934924e227544ba8246aeb3770dd4,65266b2da20d04dbe00c5c2d3bb7859e,2c9e548be18521d1c43cde1c582c6de8
...,...,...,...,...,...
114087,1fca14ff2861355f6e5f14306ff977a7,63943bddc261676b46f01ca7ac2f7bd8,29bb71b2760d0f876dfa178a76bc4734,f1d4ce8c6dd66c47bbaa8c6781c2a923,1f9ab4708f3056ede07124aad39a2554
114088,1aa71eb042121263aafbe80c1b562c9c,83c1379a015df1e13d02aae0204711ab,371579771219f6db2d830d50805977bb,b80910977a37536adeddd63663f916ad,d50d79cb34e38265a8649c383dcffd48
114089,b331b74b18dc79bcdf6532d51e1637c1,11c177c8e97725db2631073c19f07b62,8ab6855b9fe9b812cd03a480a25058a1,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48
114090,b331b74b18dc79bcdf6532d51e1637c1,11c177c8e97725db2631073c19f07b62,8ab6855b9fe9b812cd03a480a25058a1,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48


❓Does this `matching_table` have duplicated rows ?  How many duplicates ?

* If so, what could be the reason(s) ?
    * Delete the duplicated rows.

<details>
    <summary>- <i>Hints</i></summary> 
    
* For a given `order_id`, the quantity of a given `product_id` bought can be greater than 1
* In the `items` table, each individual product bought appears as an additional row. 

</details>

<details>
    <summary>- <i>Technical hints</i></summary> 
    
You have two options:    
    
* `.duplicated()`
* `drop_duplicates()`

</details>

In [10]:
matching_table.duplicated().sum()

10287

In [11]:
# Option 1 to remove duplicates
matching_table = matching_table[~matching_table.duplicated()] # solution 1

In [12]:
# Option 2 to remove duplicates
matching_table = matching_table.drop_duplicates()

In [14]:
matching_table.shape

(103805, 5)

👉Inspect the shape and the number of uniques values in  of the final DataFrame - *Hint*: use `nunique()`

🎯 It should match (103805, 5)

In [18]:
print(f"matching_table shape = {matching_table.shape}")
print("-"*50)
print('unique values: ')
print(matching_table.nunique())

matching_table shape = (103805, 5)
--------------------------------------------------
unique values: 
customer_id    99441
order_id       99441
review_id      98410
product_id     32951
seller_id       3095
dtype: int64


In [13]:
# Double check that you dropped 10287 duplicated rows
114092 - 103805

10287

### 3 - Save your logic in `data.py` 

❓Copy your logic into `get_matching_table()` in `data.py` and run the cell below to check your code!

In [19]:
from nbresult import ChallengeResult
from olist.data import Olist

data = Olist().get_matching_table()

result = ChallengeResult('matching_table',
    shape=data.shape,
    columns=sorted(list(data.columns)) 
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/04-Decision-Science/01-Project-Setup/03-Exploratory-Analysis/solution_04-Decision-Science_01-Project-Setup_03-Exploratory-Analysis
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 2 items

tests/test_matching_table.py::TestMatchingTable::test_columns [32mPASSED[0m[32m     [ 50%][0m
tests/test_matching_table.py::TestMatchingTable::test_shape [32mPASSED[0m[32m       [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/matching_table.pickle

[32mgit[39m commit -m [33m'Completed matching_table step'[39m

[32mgit[39m push origin master
