<img src = "https://i.imgur.com/HRhd2Y0.png">

In [1]:
import pandas as pd
import sqlite3

In [2]:
orders_df = pd.read_csv('../data/raw/olist_orders_dataset.csv')
orders_items_df = pd.read_csv('../data/raw/olist_order_items_dataset.csv')
products_df = pd.read_csv('../data/raw/olist_products_dataset.csv')
customers_df = pd.read_csv('../data/raw/olist_customers_dataset.csv')
reviews_df = pd.read_csv('../data/raw/olist_order_reviews_dataset.csv')

In [3]:
# Connect to sqlite db and create tables if they don't exist
cnx = sqlite3.connect('olist.db')

orders_df.to_sql(name='orders', con=cnx, if_exists='replace')
orders_items_df.to_sql(name='orders_items', con=cnx, if_exists='replace')
products_df.to_sql(name='products', con=cnx, if_exists='replace')
customers_df.to_sql(name='customers', con=cnx, if_exists='replace')
reviews_df.to_sql(name='reviews', con=cnx, if_exists='replace')

In [4]:
%%capture
%load_ext sql
%sql sqlite:///olist.db

## Question 1 : which category has the best and worst ratings?

order_items is the table that should give us the connection between an orders and products.
<br><br>
To understand better the relationship between orders, items, products and category, we will first have to explore some of these fields.

### Order items exploration
In the description of the dataset on Kaggle we can read the following note:
>"An order might have multiple items."

<br>
Let's validate this assumption by using order_item_id in order_items table.

In [5]:
%%sql
WITH orders_with_multiple_items AS ( 
    SELECT COUNT(DISTINCT order_item_id) AS items_count, order_id
    FROM orders_items
    GROUP BY order_id
    HAVING items_count>1
    ORDER BY items_count DESC)
SELECT 
    COUNT(*) AS orders_with_multiple_items_count, 
    ROUND(
        CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(*) FROM orders) * 100, 3
    ) AS percentage_of_orders,
    MAX(items_count) AS max_items_count
FROM orders_with_multiple_items m
LIMIT 5;

 * sqlite:///olist.db
Done.


orders_with_multiple_items_count,percentage_of_orders,max_items_count
9803,9.858,21


So it looks like there are multiple orders that have multiple items.
In the documentation, order_item_id is described as : 
>sequential number identifying number of items included in the same order.

### Products exploration

Our business question is about the **category** that has the best/worst ratings.<br>
Looking at the products table, we can see that category is a feature of a product, so we will have to focus on product_id column in order_items table.

In [6]:
%%sql
WITH orders_with_multiple_prods AS ( 
    SELECT COUNT(DISTINCT product_id) AS prods_count, order_id
    FROM orders_items
    GROUP BY order_id
    HAVING prods_count>1
    ORDER BY prods_count DESC)
SELECT 
    COUNT(*) AS orders_with_multiple_prods_count,
    ROUND(
       CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(*) FROM orders) * 100, 3
    ) AS percentage_of_orders,
    MAX(prods_count) AS max_prods_count
FROM orders_with_multiple_prods m;

 * sqlite:///olist.db
Done.


orders_with_multiple_prods_count,percentage_of_orders,max_prods_count
3236,3.254,8


A little more than 3% of orders have multiple products and the highest amount of products is 8.
<br><br>
The only information about reviews we have is located in the order_reviews table, where each review is referring to a specific order, not to a product.
<br><br>
Therefore, to find the category with the worst/best ratings, we would have to look at products in each order. <br>
Orders with multiple products complicate the analysis because we can't easily find, for a given order review, which product contributed to it and with which weight.
<br><br>
Since we found out that just about 3% of orders have multiple products, we are going to exclude them from the order reviews analysis.

### Reviews exploration

In [30]:
%%sql
SELECT COUNT(DISTINCT review_id) AS unique_review_id,
    COUNT(DISTINCT order_id) AS unique_order_id
FROM reviews;

 * sqlite:///olist.db
Done.


unique_review_id,unique_order_id
98410,98673


The number of unique review_id and order_id don't match. 
We are going to 

### Duplicate order_id exploration 

In [51]:
%%sql
SELECT order_id, COUNT(order_id) 
FROM reviews
GROUP BY order_id
HAVING COUNT(order_id) > 1
ORDER BY COUNT(order_id) DESC
LIMIT 3;

 * sqlite:///olist.db
Done.


order_id,COUNT(order_id)
df56136b8031ecd28e200bb18e6ddb2e,3
c88b1d1b157a9999ce368f218a407141,3
8e17072ec97ce29f0e1f111e598b0c85,3


It looks like there are multiple reviews for the same order.
Let's look at one example with 3 reviews.

In [52]:
%%sql
SELECT * FROM reviews WHERE order_id="8e17072ec97ce29f0e1f111e598b0c85"

 * sqlite:///olist.db
Done.


index,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
44694,67c2557eb0bd72e3ece1e03477c9dff5,8e17072ec97ce29f0e1f111e598b0c85,1,,Entregou o produto errado.,2018-04-07 00:00:00,2018-04-08 22:48:27
64510,2d6ac45f859465b5c185274a1c929637,8e17072ec97ce29f0e1f111e598b0c85,1,,Comprei 3 unidades do produto vieram 2 unidades que não corresponde com o que comprei. Devido a minha opinião é negativa com relação a esse vendedor pois não não cumpriu com o prometido na venda.,2018-04-07 00:00:00,2018-04-07 21:13:05
92300,6e4c4086d9611ae4cc0cc65a262751fe,8e17072ec97ce29f0e1f111e598b0c85,1,,"Embora tenha entregue dentro do prazo, não enviou o produto que comprei.",2018-04-14 00:00:00,2018-04-16 11:37:31


These reviews have a different comment and review_answer_timestamp, <br> <br>

review_answer_timestamp : 
>Shows satisfaction survey answer timestamp.

<br>

There are also two reviews with the same review_creation_date.

review_creation_date :
>Shows the date in which the satisfaction survey was sent to the customer.

These two reviews' answers were submitted at a different time ( even different day ), so it looks like, unless we have data quality issues, that the user was able to submit answers twice for the same survey.
<br><br>

But how often does it happen that an order has more than one review and review_answer_timestamp is different? 

In [62]:
%%sql
WITH multiple_reviews_orders AS (
    SELECT order_id, COUNT(order_id) 
    FROM reviews
    GROUP BY order_id
    HAVING COUNT(order_id) > 1 AND COUNT(DISTINCT review_answer_timestamp)>1
    ORDER BY COUNT(order_id) DESC
)
SELECT COUNT(*) AS orders_with_multiple_reviews_answers,
       ROUND(
        CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(DISTINCT order_id) FROM reviews) * 100, 3
       ) AS percentage_of_reviewed_orders
FROM multiple_reviews_orders;

 * sqlite:///olist.db
Done.


orders_with_multiple_reviews_answers,percentage_of_reviewed_orders
547,0.554


And how often does an order have more than one review and review_creation_date is different?

In [63]:
%%sql
WITH multiple_reviews_orders AS (
    SELECT order_id, COUNT(order_id) 
    FROM reviews
    GROUP BY order_id
    HAVING COUNT(order_id) > 1 AND COUNT(DISTINCT review_creation_date)>1
    ORDER BY COUNT(order_id) DESC
)
SELECT COUNT(*) AS orders_with_multiple_reviews_surveys,
       ROUND(
        CAST( COUNT(*) AS FLOAT) / ( SELECT COUNT(DISTINCT order_id) FROM reviews) * 100, 3
       ) AS percentage_of_reviewed_orders
FROM multiple_reviews_orders;

 * sqlite:///olist.db
Done.


orders_with_multiple_reviews_surveys,percentage_of_reviewed_orders
392,0.397


There are a few of these cases, but the percentage is not that high.
<br><br>
Looking at the **reviews dataset** documentation on Kaggle, we can read the following : 
>This dataset includes data about the reviews made by the customers.
After a customer purchases the product from Olist Store a seller gets notified to fulfill that order.<br> Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

<br>
From this information it's not very clear how it can happen that an user would get more than one survey for the same order. <br>
Regarding the user being able to submit more than one answer for the same survey, we can only guess that the software used to collect reviews doesn't prevent multiple submissions.

In [28]:
%%sql
SELECT Count(review_id) - Count(DISTINCT review_id)
FROM   reviews;

 * sqlite:///olist.db
Done.


Count(review_id) - Count(DISTINCT review_id)
814


In [22]:
%%sql
SELECT * FROM reviews WHERE review_id = "0501aab2f381486c36bf0f071442c0c2"

 * sqlite:///olist.db
Done.


index,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
7629,0501aab2f381486c36bf0f071442c0c2,0068c109948b9a1dfb8530d1978acef3,1,,Espero obter uma resposta para minha encomenda. Comprei 3 controles remotos para ar condicionado e apenas recebi 1 controle Komeco e dos 2 controles Elgin pedidos e pago só foi entregue 1 controle Elg,2018-02-09 00:00:00,2018-02-10 23:55:18
66952,0501aab2f381486c36bf0f071442c0c2,d75cb3755738c4ae466303358f97bc55,1,,Espero obter uma resposta para minha encomenda. Comprei 3 controles remotos para ar condicionado e apenas recebi 1 controle Komeco e dos 2 controles Elgin pedidos e pago só foi entregue 1 controle Elg,2018-02-09 00:00:00,2018-02-10 23:55:18
