# Practice 3 - Average review score per seller
This notebook uses three tables in the Olist database which are olist_orders_dataset (renamed to orders), olist_order_items_dataset (renamed to order_items), and olist_order_reviews_dataset (renamed to order_reviews).<br>

Main question:<br>

**What is the average review score of each seller?**

## Connect and load in the database

In [1]:
%load_ext sql
%sql mysql+mysqlconnector://root:***@localhost/olist

'Connected: root@olist'

### Relational Schema <br>
<img src="files/photos/P3.png">

## SQL queries
### Confusing review data
Customers can complete a satisfaction survey sent to them after they receive the order or the deadline is due ([More from Kaggle data description](https://www.kaggle.com/olistbr/brazilian-ecommerce). Since every created order has a deadline (order_estimated_delivery_date) no matter what its status is, customers are able to take the survey even when they don't receive the order. Is that true?<br>First, let's check if there's any null value in review_score column.

In [2]:
%%sql
SELECT COUNT(review_score) AS review_score_null
FROM order_reviews
WHERE review_score = 0 OR review_score IS NULL;

 * mysql+mysqlconnector://root:***@localhost/olist
1 rows affected.


review_score_null
0


Surprisingly, there's no 0 review score and every order has at least one review score. I want to know how many scores (1 to 5) there are in each order status group.

In [3]:
%%sql
SELECT order_status,
       SUM(IF(review_score = 1,1,0)) AS score_1,
       SUM(IF(review_score = 2,1,0)) AS score_2,
       SUM(IF(review_score = 3,1,0)) AS score_3,
       SUM(IF(review_score = 4,1,0)) AS score_4,
       SUM(IF(review_score = 5,1,0)) AS score_5
FROM (SELECT order_status, review_score
      FROM orders o JOIN order_reviews re
      ON o.order_id = re.order_id) a
GROUP BY order_status
ORDER BY order_status;

 * mysql+mysqlconnector://root:***@localhost/olist
8 rows affected.


order_status,score_1,score_2,score_3,score_4,score_5
approved,1,0,0,1,0
canceled,436,45,50,27,71
created,4,0,0,0,1
delivered,9754,3015,8056,19040,57150
invoiced,235,26,16,15,26
processing,261,19,9,6,7
shipped,693,85,121,90,129
unavailable,474,45,35,21,36


In [4]:
%%sql
SELECT COUNT(DISTINCT order_id) AS num_reviewed_orders
FROM order_reviews;

 * mysql+mysqlconnector://root:***@localhost/olist
1 rows affected.


num_reviewed_orders
99441


It seems unrealistic that all customers answered satisfaction surveys for every order. To confirm this, I would need to check when these review scores were assigned. As usual, let's check for null values.

In [5]:
%%sql
SELECT SUM(IF(order_estimated_delivery_date = '0000-00-00 00:00:00',1,0))
       AS order_estimated_delivery_date_null,
       SUM(IF(review_creation_date = '0000-00-00 00:00:00',1,0))
       AS review_creation_date_null,
       SUM(IF(review_answer_timestamp = '0000-00-00 00:00:00',1,0))
       AS review_answer_timestamp_null
FROM orders o JOIN order_reviews re
ON o.order_id = re.order_id;

 * mysql+mysqlconnector://root:***@localhost/olist
1 rows affected.


order_estimated_delivery_date_null,review_creation_date_null,review_answer_timestamp_null
0,0,0


My question is are these review scores assigned before or after the deadline?

In [6]:
%%sql
SELECT order_status,
       SUM(IF(TIMESTAMPDIFF(
           hour,order_estimated_delivery_date,review_answer_timestamp)
              >= 0,1,0)) AS after_deadline,
       SUM(IF(TIMESTAMPDIFF(
           hour,order_estimated_delivery_date,review_answer_timestamp)
              < 0,1,0)) AS before_deadline
FROM orders o JOIN order_reviews re
ON o.order_id = re.order_id
GROUP BY order_status
ORDER BY order_status;

 * mysql+mysqlconnector://root:***@localhost/olist
8 rows affected.


order_status,after_deadline,before_deadline
approved,2,0
canceled,526,103
created,5,0
delivered,14660,82355
invoiced,311,7
processing,299,3
shipped,1054,64
unavailable,604,7


Now we see that customers can somehow review an order even when the deadline is not due. Another test: are these review scores assigned before or after the survey created?

In [7]:
%%sql
SELECT order_status,
       SUM(IF(TIMESTAMPDIFF(
           hour,review_creation_date,review_answer_timestamp)
              >= 0,1,0)) AS after_survey,
       SUM(IF(TIMESTAMPDIFF(
           hour,review_creation_date,review_answer_timestamp)
              < 0,1,0)) AS before_survey
FROM orders o JOIN order_reviews re
ON o.order_id = re.order_id
GROUP BY order_status
ORDER BY order_status;

 * mysql+mysqlconnector://root:***@localhost/olist
8 rows affected.


order_status,after_survey,before_survey
approved,2,0
canceled,629,0
created,5,0
delivered,97015,0
invoiced,318,0
processing,302,0
shipped,1118,0
unavailable,611,0


Alright, now we're sure that these review scores are collected from satisfaction surveys. And these surveys are not sent out only when the order is delivered or after the deadline is due. Last test: Are these review scores have the review_answer_timestamp values which are the same as review_creation_date values?

In [8]:
%%sql
SELECT order_status,
       SUM(IF(TIMESTAMPDIFF(
           hour,review_creation_date,review_answer_timestamp)
              = 0,1,0)) AS suprise
FROM orders o JOIN order_reviews re
ON o.order_id = re.order_id
GROUP BY order_status
ORDER BY order_status;

 * mysql+mysqlconnector://root:***@localhost/olist
8 rows affected.


order_status,suprise
approved,0
canceled,0
created,0
delivered,0
invoiced,0
processing,0
shipped,0
unavailable,0


Well, customers really completed these surveys, not some AI as I imagined. I really need more information to understand how customers review their orders though.<br>
Let's move forward with what we have now.

### Average review score
Olist requires merchants to have an average review score at least 4 (on a scale of 0 to 5) to guarantee their performance.[(More)](https://get.olist.help/pt-BR/articles/413344-acordo-de-nivel-de-servico-entre-parceiro-olist-e-lojista)<br>
Below query provides number of sellers and those who meet this requirement. This query based on all review scores of all orders.

In [9]:
%%sql
SELECT b.num_sellers,
       COUNT(a.seller_id) AS num_qualified_sellers,
       ROUND(COUNT(a.seller_id)/b.num_sellers*100,2)
       AS percentage
FROM (SELECT seller_id, AVG(review_score) AS avg_score
      FROM order_reviews re JOIN order_items oi
      ON re.order_id = oi.order_id
      GROUP BY seller_id
      HAVING avg_score >= 4) a
CROSS JOIN 
     (SELECT COUNT(DISTINCT seller_id) AS num_sellers
      FROM order_items) b;

 * mysql+mysqlconnector://root:***@localhost/olist
1 rows affected.


num_sellers,num_qualified_sellers,percentage
3095,1948,62.94


62.94% sellers have average review score of 4 or higher. Below is the top 10 of them.

In [10]:
%%sql
SELECT seller_id, ROUND(AVG(review_score),2) AS avg_score
FROM order_reviews re JOIN order_items oi
ON re.order_id = oi.order_id
GROUP BY seller_id
HAVING avg_score >= 4
ORDER BY avg_score DESC
LIMIT 10;

 * mysql+mysqlconnector://root:***@localhost/olist
10 rows affected.


seller_id,avg_score
062c325cd6a2b87845fab56b4ec2eeae,5.0
296729ffb9b684050dd24836dac4494a,5.0
1cd9e0cc1839d55516843def5600816d,5.0
48b6c3f4c6a93171da04b75313f2130f,5.0
48efc9d94a9834137efd9ea76b065a38,5.0
fa5fdc4e4bb6bd1009ad0e4ac4096562,5.0
166e8f1381e09651983c38b1f6f91c11,5.0
44717f64ec2a457979cf83c429077666,5.0
b5bb2b985208834bd5bd86c7a402bbad,5.0
8ec76bb0965af3f007692b26fa9d6623,5.0
