# Practice 1 - Customer growth rate and repeating rate
This notebook uses two tables in the Olist database which are olist_customers_dataset (renamed to customers) and olist_orders_dataset (renamed to orders).<br>
Three main questions are:

1. **What is the customers growth rate?**
3. **What is the customers repetating rate?**
2. **What is the median number of orders placed by each customer?**

## Connect and load in the database

In [1]:
%load_ext sql
%sql mysql+mysqlconnector://root:***@localhost/olist

'Connected: root@olist'

### Relational Schema <br>
<img src="files/photos/P1.png">

## SQL queries

### Quick check
I would like to know if values in the customer_state column change within a certain customer_unique_id.

In [2]:
%%sql
SELECT customer_unique_id,
       COUNT(DISTINCT customer_state) AS num_states
FROM customers
GROUP BY customer_unique_id
HAVING num_states > 1
ORDER BY num_states DESC
LIMIT 10;

 * mysql+mysqlconnector://root:***@localhost/olist
10 rows affected.


customer_unique_id,num_states
d44ccec15f5f86d14d6a2cfa67da1975,3
2410195f6521688005612363835a2671,2
2c45ab66a3dae52960147e76a35740ff,2
2c6a91479a7dc00d8c9d650d8dee88ca,2
408aee96c75632a92e5079eee61da399,2
5192c897072033288df55bd01b0e5737,2
5275b2f97b9c995d3d05a58610c0bb67,2
547d0504ca415eb4864fa3030f73d3f3,2
5cbfdb85ec130898108b32c50d619c39,2
62a25a159f9fd2ab7c882d9407f49aa9,2


There are customers who have more than one state addresses, which means we cannot use customer_state values to analyze the number of customers by state. All the queries in this notebook from now will focus on year basis only. Let's check the order purchase time.

In [3]:
%%sql
SELECT MIN(order_purchase_timestamp) AS first_order,
       MAX(order_purchase_timestamp) AS last_order
FROM orders;

 * mysql+mysqlconnector://root:***@localhost/olist
1 rows affected.


first_order,last_order
2016-09-04 21:15:19,2018-10-17 17:30:18


The first order was placed in the beginning of September 2019, and the last order was placed in the middle of October 2018.

In [4]:
%%sql
SELECT YEAR(order_purchase_timestamp) AS year,
       MONTH(order_purchase_timestamp) AS month
FROM orders
GROUP BY year, month
ORDER BY year, month;

 * mysql+mysqlconnector://root:***@localhost/olist
25 rows affected.


year,month
2016,9
2016,10
2016,12
2017,1
2017,2
2017,3
2017,4
2017,5
2017,6
2017,7


We have a missing month which is November, in 2016. This was explained clearly [here](https://www.kaggle.com/olistbr/brazilian-ecommerce/discussion/69728).

### Number of customers and customers growth rate
Assuming that Olist started in 2016, I would like to know the number of new customers increasing each year. 

In [5]:
%%sql
SELECT YEAR(order_purchase_timestamp) AS year,
       COUNT(DISTINCT customer_unique_id) AS num_customers             
FROM customers c JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY year;

 * mysql+mysqlconnector://root:***@localhost/olist
3 rows affected.


year,num_customers
2016,326
2017,43713
2018,52749


In [6]:
%%sql
SELECT c.year, c.new_customers,
       CASE c.year
       WHEN '2016' THEN 0
       WHEN '2017' THEN ROUND(c.new_customers/total_2016*100,2)
       WHEN '2018' THEN ROUND(c.new_customers/total_2017*100,2)
       END AS growth_rate
FROM
(SELECT b.*,
        SUM(IF(a.year = '2016',1,0)) AS total_2016,
        SUM(IF(a.year IN ('2016','2017'),1,0)) AS total_2017,
        COUNT(a.customer_unique_id) AS total_2018
FROM (SELECT customer_unique_id,
             YEAR(MIN(order_purchase_timestamp)) AS year
      FROM customers c JOIN orders o
      ON c.customer_id = o.customer_id
      GROUP BY customer_unique_id) a
CROSS JOIN (SELECT year, 
                   COUNT(customer_unique_id) AS new_customers
            FROM (SELECT customer_unique_id,
                         YEAR(MIN(order_purchase_timestamp)) AS year
                  FROM customers c JOIN orders o
                  ON c.customer_id = o.customer_id
                  GROUP BY customer_unique_id) a
            GROUP BY year) b
GROUP BY b.year) c
ORDER BY year;

 * mysql+mysqlconnector://root:***@localhost/olist
3 rows affected.


year,new_customers,growth_rate
2016,326,0.0
2017,43708,13407.36
2018,52062,118.23


### Customers repeating rate
Next, let's check the customers repeating rate.

In [7]:
%%sql
SELECT a.year, c.repeat_customers,
       ROUND(c.repeat_customers/a.num_customers*100,2) AS repeat_rate
FROM (SELECT YEAR(order_purchase_timestamp) AS year,
             COUNT(DISTINCT customer_unique_id) AS num_customers             
      FROM customers c JOIN orders o
      ON c.customer_id = o.customer_id
      GROUP BY year) a
JOIN (SELECT year,
             COUNT(DISTINCT customer_unique_id) AS repeat_customers
      FROM (SELECT CASE ROW_NUMBER() 
                        OVER(PARTITION BY customer_unique_id 
                             ORDER BY order_purchase_timestamp)
                   WHEN 1 THEN 'Total'
                   ELSE 'Return'
                   END AS customer_type, c.customer_unique_id,
                   YEAR(order_purchase_timestamp) AS year
             FROM customers c JOIN orders o
             ON c.customer_id = o.customer_id
             ORDER BY customer_unique_id, order_purchase_timestamp) b
             WHERE customer_type = 'Return'
             GROUP BY year, customer_type) c
ON a.year = c.year;   

 * mysql+mysqlconnector://root:***@localhost/olist
3 rows affected.


year,repeat_customers,repeat_rate
2016,3,0.92
2017,1261,2.88
2018,1799,3.41


### Number of orders and median number of orders
Lastly, I would like to know the total number of orders placed each year and the median number of orders placed by unique customers each year.

In [8]:
%%sql
SELECT YEAR(order_purchase_timestamp) AS year,
       COUNT(DISTINCT order_id) AS num_orders             
FROM customers c JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY year;

 * mysql+mysqlconnector://root:***@localhost/olist
3 rows affected.


year,num_orders
2016,329
2017,45101
2018,54011


In fact, MySQL has yet to support one function for median value, but we can find different solutions to address this. Below is a solution developed from [Daniel Setzermann's blog](https://technicalmarketing.guide/blog/how-to-calculate-the-median-per-group-with-mysql/).<br>

In [9]:
%%sql
SELECT year,
       ROUND(AVG(num_orders),0) AS median_orders
FROM (SELECT ROW_NUMBER() 
             OVER(PARTITION BY a.year
                  ORDER BY a.num_orders) AS count_of_group,
             a.year, a.num_orders, b.total_of_group
      FROM (SELECT YEAR(order_purchase_timestamp) AS year,
                   COUNT(DISTINCT c.customer_id) AS num_orders
            FROM customers c JOIN orders o
            ON c.customer_id = o.customer_id
            GROUP BY year, customer_unique_id
            ORDER BY year, num_orders) a
      JOIN (SELECT year,
                   COUNT(num_orders) AS total_of_group
            FROM (SELECT YEAR(order_purchase_timestamp) AS year,
                         COUNT(DISTINCT c.customer_id) AS num_orders
                  FROM customers c JOIN orders o
                  ON c.customer_id = o.customer_id
                  GROUP BY year, customer_unique_id) a
            GROUP BY year) b
      ON a.year = b.year) c
WHERE count_of_group BETWEEN total_of_group/2 AND total_of_group/2 + 1
GROUP BY year
ORDER BY year;

 * mysql+mysqlconnector://root:***@localhost/olist
3 rows affected.


year,median_orders
2016,1
2017,1
2018,1


Mostly each customer placed only 1 order each year via Olist.