<img src = "https://i.imgur.com/HRhd2Y0.png">

In [1]:
import pandas as pd
import sqlite3

In [2]:
orders_df = pd.read_csv('../data/raw/olist_orders_dataset.csv')
orders_items_df = pd.read_csv('../data/raw/olist_order_items_dataset.csv')
products_df = pd.read_csv('../data/raw/olist_products_dataset.csv')
customers_df = pd.read_csv('../data/raw/olist_customers_dataset.csv')
reviews_df = pd.read_csv('../data/raw/olist_order_reviews_dataset.csv')

In [30]:
# Connect to sqlite db and create tables if they don't exist
cnx = sqlite3.connect('olist.db')

orders_df.to_sql(name='orders', con=cnx, if_exists='replace')
orders_items_df.to_sql(name='orders_items', con=cnx, if_exists='replace')
products_df.to_sql(name='products', con=cnx, if_exists='replace')
customers_df.to_sql(name='customers', con=cnx, if_exists='replace')
reviews_df.to_sql(name='reviews', con=cnx, if_exists='replace')

In [31]:
%%capture
%load_ext sql
%sql sqlite:///olist.db

## Question 1 : which category has the best and worst ratings?

order_items is the table that should give us the connection between an orders and products.
<br><br>
To understand better the relationship between orders, items, products and category, we will first have to explore some of these fields.

### Order items exploration
In the description of the dataset on Kaggle we can read the following note:
>"An order might have multiple items."

<br>
Let's validate this assumption by using order_item_id in order_items table.

In [41]:
%%sql
WITH orders_with_multiple_items AS ( 
    SELECT COUNT(DISTINCT order_item_id) AS items_count, order_id
    FROM orders_items
    GROUP BY order_id
    HAVING items_count>1
    ORDER BY items_count DESC)
SELECT 
    COUNT(*) AS orders_with_multiple_items_count, 
    ROUND(
        CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(*) FROM orders) * 100, 3
    ) AS percentage_of_orders,
    MAX(items_count) AS max_items_count
FROM orders_with_multiple_items m
LIMIT 5;

 * sqlite:///olist.db
Done.


orders_with_multiple_items_count,percentage_of_orders,max_items_count
9803,9.858,21


So it looks like there are multiple orders that have multiple items.
In the documentation, order_item_id is described as : 
>sequential number identifying number of items included in the same order.

### Products exploration

Our business question is about the **category** that has the best/worst ratings.<br>
Looking at the products table, we can see that category is a feature of a product, so we will have to focus on product_id column in order_items table.

In [42]:
%%sql
WITH orders_with_multiple_prods AS ( 
    SELECT COUNT(DISTINCT product_id) AS prods_count, order_id
    FROM orders_items
    GROUP BY order_id
    HAVING prods_count>1
    ORDER BY prods_count DESC)
SELECT 
    COUNT(*) AS orders_with_multiple_prods_count,
    ROUND(
       CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(*) FROM orders) * 100, 3
    ) AS percentage_of_orders,
    MAX(prods_count) AS max_prods_count
FROM orders_with_multiple_prods m;

 * sqlite:///olist.db
Done.


orders_with_multiple_prods_count,percentage_of_orders,max_prods_count
3236,3.254,8


A little more than 3% of orders have multiple products and the highest amount of products is 8.
<br><br>
The only information about reviews we have is located in the order_reviews table, where each review is referring to a specific order, not to a product.
<br><br>
Therefore, to find the category with the worst/best ratings, we would have to look at products in each order. <br>
Orders with multiple products complicate the analysis because we can't easily find, for a given order review, which product contributed to it and with which weight.
<br><br>
Since we found out that just about 3% of orders have multiple products, we are going to exclude them from the order reviews analysis.

In [9]:
%%sql
SELECT review_score, 
       ROUND( CAST(COUNT(*) AS FLOAT)/(SELECT COUNT(*) FROM reviews)*100, 3) AS percentage
FROM reviews
GROUP BY review_score;

 * sqlite:///olist.db
Done.


review_score,percentage
1,11.513
2,3.176
3,8.243
4,19.292
5,57.776


In [10]:
%%sql
SELECT COUNT(*) FROM reviews;

 * sqlite:///olist.db
Done.


COUNT(*)
99224
