### Q1&Q2 Upload data and load into MySQL

Used PSFTP.

Used commands recommended.

Used DATE format for dates, and other formats as appropriate.

### Q3&Q4 Observe and key all tables

First use id column as primary key for user, subscription and tables:

```
ALTER TABLE subscriptions ADD PRIMARY KEY (id);

ALTER TABLE users ADD PRIMARY KEY (id);

ALTER TABLE purchases ADD PRIMARY KEY (id);
```

And make necessary columns NOT NULL e.g.:

```
ALTER TABLE assignments MODIFY sub_id INT NOT NULL;
```

FOREIGN KEY on subscriptions.id:

```
ALTER TABLE `assignments` 
ADD FOREIGN KEY(`sub_id`) 
REFERENCES subscriptions(id);
```
FOREIGN KEY on users.id:

```
ALTER TABLE `purchases` 
ADD FOREIGN KEY(`user_id`) 
REFERENCES users(id);	

ALTER TABLE `subscriptions` 
ADD FOREIGN KEY(`user_id`) 
REFERENCES users(id);	
```
Join works much better now.

### Q5 What's wrong with data set?

There are ~3x subscriptions starting in november & december compared to other months.

There are 2 duplicate IP addresses.

Most of the names are duplicate/triplicate.

Twice the purchase amount in January compared to any other month.

All the purchases are roughly the same value, wouldn't make sense for a Costco-like store.

### Q6 Evaluate AB Test

We can get a ground truth purchases/subscr/month the following way:

```
select sum(amount) from purchases where date<'2016-07-01';

+-------------+
| sum(amount) |
+-------------+
|   916984.65 |
+-------------+

select count(*) from assignments;

+----------+
| count(*) |
+----------+
|   154288 |
+----------+

```
So the purch/subscr/month is: $\dfrac{\textrm{Total purchase value}} {Months * Subscriptions} = \dfrac{916984.65}{6*154288} = 0.99 $

Now we need to look at the following to get total purch amt/user/month for each user test_group.

```
select purchases.user_id,amount,date,test_group 
from purchases 
join subscriptions 
on purchases.user_id = subscriptions.user_id 
join assignments 
on subscriptions.id = sub_id
where date>='2016-07-01';

...
|  152550 |   4.78 | 2016-09-01 | control    |
|  152550 |   4.71 | 2016-09-01 | control    |
|  152562 |   5.03 | 2016-09-01 | test       |
|  152582 |   4.75 | 2016-09-01 | control    |
|  152594 |   5.02 | 2016-09-01 | control    |
|  152598 |   4.96 | 2016-09-01 | test       |
|  152606 |   4.97 | 2016-09-01 | control    |
|  121530 |   4.77 | 2016-09-01 | test       |
+---------+--------+------------+------------+

```
And get the sum of purchases for each group.
```

select test_group, sum(amount) as tot_amt from
(select purchases.user_id,amount,date,test_group 
from purchases 
join subscriptions 
on purchases.user_id = subscriptions.user_id 
join assignments 
on subscriptions.id = sub_id
where date>='2016-07-01') as purchinf
group by test_group
;

+------------+-----------+
| test_group | tot_amt   |
+------------+-----------+
| control    | 166288.69 |
| test       | 172182.07 |
+------------+-----------+
```

But now we need the total number of users in each cohort:

```
select test_group, count(*) from assignments group by test_group;

+------------+----------+
| test_group | count(*) |
+------------+----------+
| control    |    76919 |
| test       |    77369 |
+------------+----------+
```

So purch amt per user per month is

control: $\dfrac{166288.69}{76919 * 2} = 1.08$

test:  $\dfrac{172182.07} {77369 * 2} = 1.11$

It appears the test is slightly better, when normalized by all users. This could be counfounded by the proportion of active subscriptions in each test group under the date range.
```
select avg(tot_amt_per_user) as avg, stddev(tot_amt_per_user) as std, test_group from
(select id user_id, sum(amount) as tot_amt_per_user, test_group from
(select subscriptions.id,amount,date,test_group
from subscriptions
left join purchases 
on purchases.user_id = subscriptions.user_id 
join assignments
on subscriptions.id = assignments.sub_id
where date>'2016-07-01') as purchinf
group by user_id) as user_purch
group by test_group;


select test_group, count(*) from 
(
select s.id, test_group from subscriptions s 
join assignments 
on s.id=sub_id 
where end_date > '2016-07-01' or end_date IS NULL
) as tab
group by test_group;


select user_id, count(*) from (
select s.id,user_id, test_group 
from subscriptions s 
join assignments 
on 
s.id=sub_id 
where (end_date > '2016-07-01') or (end_date IS NULL)
) as tab group by user_id having count(*) > 1;



select test_group, avg(tot_per_user) as avg, stddev(tot_per_user) as std from
(
    select user_id,sum(amount) as tot_per_user,test_group from
    (
        select s.id,s.user_id,IFNULL(amount, 0) as amount,test_group,purchases.date
        from subscriptions s 
        join assignments 
        on s.id=sub_id 
        left join purchases 
        on purchases.user_id = s.user_id 
        where ((end_date > '2016-07-01') or (end_date IS NULL))
        and ((purchases.date > '2016-07-01') or (purchases.date IS NULL))
    ) as sums
    group by user_id
) as final
group by test_group;
```



### Q7 Company evaluation

We can look at how sales are changing month to month:

```
select sum(amount) from purchases group by month(date);

+-------------+
| sum(amount) |
+-------------+
|   254272.05 |
|   164846.26 |
|   110667.99 |
|   104222.72 |
|   141337.98 |
|   141637.65 |
|   160638.66 |
|   166556.71 |
|     5380.47 |
+-------------+

```

There's a steep dropoff after January, (after-holiday sales?) but otherwise it looks mostly stable-ish. 

Also, we can count number of subscriptions beginning (and that have ended) in each year:

```
select year(start_date), count(*) as total, count(end_date) as churned from subscriptions group by year(start_date); 


+------------------+-------+---------+
| year(start_date) | total | churned |
+------------------+-------+---------+
|             2012 |  3870 |    3828 |
|             2013 | 10815 |   10564 |
|             2014 | 31351 |   30121 |
|             2015 | 62568 |   56017 |
|             2016 | 45684 |   22751 |
+------------------+-------+---------+

```

So it looks like overall there is pretty healthy growth in new subscriptions overall, but the churn seems high... only ~10% of subscriptions starting in 2015 are still current. About 50% of subscriptions this year have already churned.

### Q8 Active user table

First create a date table with all dates from:

1/9/2012 - 9/1/2016
```
DROP PROCEDURE IF EXISTS filldates;
DELIMITER |
CREATE PROCEDURE filldates(dateStart DATE, dateEnd DATE)
BEGIN
  WHILE dateStart <= dateEnd DO
    INSERT INTO _date (datelist) VALUES (dateStart);
    SET dateStart = date_add(dateStart, INTERVAL 1 DAY);
  END WHILE;
END;
|
DELIMITER ;
CALL filldates('2012-01-09','2016-09-01');

SELECT * FROM _date;

...
| 2016-08-27 |
| 2016-08-28 |
| 2016-08-29 |
| 2016-08-30 |
| 2016-08-31 |
| 2016-09-01 |
+------------+
```

Try to make the table

```
SELECT 
user_id, 
datelist as date, 
    (case 
    when end_date IS NULL
    then TRUE
    when (end_date IS NOT NULL) and (datelist<end_date)
    then TRUE
    else FALSE
    end) as is_active ,
start_date as signup_date,
signup_platform  
from subscriptions 
join _date 
on start_date<datelist 
join users 
on users.id = user_id
WHERE user_id BETWEEN 100000 AND 105000;

|  104997 | 2016-09-01 |         0 | 2015-12-25  | ios             |
|  104998 | 2016-09-01 |         0 | 2015-12-25  | ios             |
|  104999 | 2016-09-01 |         0 | 2015-12-25  | ios             |
|  105000 | 2016-09-01 |         0 | 2015-12-25  | ios             |
+---------+------------+-----------+-------------+-----------------+
1288429 rows in set (1.97 sec)

(FALSE evaluates to 0)

```

...and MySQL crashes if you don't restrict user_id

To make it more efficient you could only add today's entries. Entries from the past would not change.

Let's count churn by date.

```
SELECT end_date, count(end_date) as num_churn from subscriptions group by end_date;

+------------+-----------+
| end_date   | num_churn |
+------------+-----------+
...
...
...
| 2016-08-24 |       182 |
| 2016-08-25 |       178 |
| 2016-08-26 |       185 |
| 2016-08-27 |       187 |
| 2016-08-28 |       196 |
| 2016-08-29 |       182 |
| 2016-08-30 |       189 |
| 2016-08-31 |       185 |
| 2016-09-01 |       176 |
+------------+-----------+
```

### Q9 Annual vs monthly

We can look at the purchase amount this year, grouped by subscription type:

```
select sum(amount), sub_type 
from purchases 
join subscriptions 
on subscriptions.user_id = purchases.user_id 
group by sub_type;

+-------------+----------+
| sum(amount) | sub_type |
+-------------+----------+
|    91742.75 | annual   |
|  1178183.30 | monthly  |
+-------------+----------+


```

So the change should definitely be made. 

Expected revenue increase would be ~ 0.10 * (1,200,000 - 100,000) = 110,000.

Annualized this would be about 110,000/8*12 = 165,000