# 00. Study previous assessments

# 0. Hey! Remember to use the tests!!!!
When you clone the assessment repo and you are in the `dsi-final-assessment` directory, make sure unittests works by entering `make test` in your terminal.

# 1. SQL

## 1a. readychef

If you still have the readychef database, you can access it by entering `psql readychef` in your terminal.

If the above command fails, you'll have to create the database and load it with the following commands.
Here, `$` represents the terminal shell and `=#` represents the postgres shell.
```
$ psql
=# CREATE DATABASE readychef;
=# \q
$ psql readychef < path/to/readychef.sql
$ psql readychef
=# \d
           List of relations
 Schema |   Name    | Type  |  Owner   
--------+-----------+-------+----------
 public | events    | table | postgres
 public | meals     | table | postgres
 public | referrals | table | postgres
 public | users     | table | postgres
 public | visits    | table | postgres
(5 rows)


```

```
readychef=# SELECT * FROM events LIMIT 10;
     dt     | userid | meal_id | event  
------------+--------+---------+--------
 2013-01-01 |      3 |      18 | bought
 2013-01-01 |      7 |       1 | like
 2013-01-01 |     10 |      29 | bought
 2013-01-01 |     11 |      19 | share
 2013-01-01 |     15 |      33 | like
 2013-01-01 |     18 |       4 | share
 2013-01-01 |     18 |      40 | bought
 2013-01-01 |     21 |      10 | share
 2013-01-01 |     21 |       4 | like
 2013-01-01 |     22 |      23 | bought
```


```
readychef=# SELECT * FROM meals LIMIT 10;
 meal_id |  type   |     dt     | price 
---------+---------+------------+-------
       1 | french  | 2013-01-01 |    10
       2 | chinese | 2013-01-01 |    13
       3 | mexican | 2013-01-02 |     9
       4 | italian | 2013-01-03 |     9
       5 | chinese | 2013-01-03 |    12
       6 | italian | 2013-01-03 |     9
       7 | italian | 2013-01-03 |    10
       8 | french  | 2013-01-03 |    14
       9 | italian | 2013-01-03 |    13
      10 | french  | 2013-01-03 |     7
(10 rows)
```

### write queries to find the following

#### i) total meals bought by each user (including users who bought zero meals)
#### ii) Total money spent by each user
#### iii) Total visits from each user
#### iv) Average visits per month from each user


(try writing them before looking at the answers below)

In [None]:
SELECT userid, SUM(CASE WHEN event='bought' THEN 1 ELSE 0 END) AS ct
FROM events
GROUP BY userid
ORDER BY ct;

In [None]:
SELECT userid, COUNT(meal_id) as ct
FROM events
WHERE event='bought'
GROUP BY userid
ORDER BY ct;

In [None]:
## ^^^ boughts

In [None]:
SELECT users.userid, COALESCE(ct, 0)
FROM users
LEFT JOIN (
        SELECT userid, COUNT(meal_id) as ct
        FROM events
        WHERE event='bought'
        GROUP BY userid  
           ) AS boughts
ON users.userid = boughts.userid
ORDER BY userid
LIMIT 25;

In [None]:
SELECT u.userid, COUNT(events.userid) as ct
FROM users AS u
LEFT JOIN events
ON u.userid = events.userid AND events.event = 'bought'
GROUP BY u.userid
ORDER BY ct;

## 1b. another table

Use table 'customers' below

| cust_id | cust_name | current_city | hometown |
|:----------:|:------------:|:----------:|:-----------:|
| 1 | Amanda | Atlanta | Raleigh |
| 2 | Brittany | Denver | New York |
| 3 | Charles | Paris | Raleigh |
| 4 | David | San Diego | Los Angeles |
| 5 | Elizabeth | Atlanta | London |
| 6 | Greg | Denver | Atlanta |
| 7 | Maria | Raleigh | New York |
| 8 | Sarah | New York | Raleigh |
| 9 | Thomas | Atlanta | Raleigh |

#### loading this data

What's the fastest way to load some tab delimited data into postgres? 

We'll use the `CREATE TABLE` and `COPY` commands.

https://www.postgresql.org/docs/current/static/sql-createtable.html

https://www.postgresql.org/docs/current/static/sql-copy.html

Save the following into a file called `customers.sql`

```SQL
CREATE TABLE customers (
    cust_id integer,
    cust_name character varying,
    current_city character varying,
    hometown character varying);

COPY customers (cust_id, cust_name, current_city, hometown) FROM stdin;
1	Amanda	Atlanta	Raleigh
2	Brittany	Denver	New York
3	Charles	Paris	Raleigh
4	David	San Diego	Los Angeles
5	Elizabeth	Atlanta	London
6	Greg	Denver	Atlanta
7	Maria	Raleigh	New York
8	Sarah	New York	Raleigh
9	Thomas	Atlanta	Raleigh
\.

```

`CREATE TABLE` specified the schema. Here we are using `COPY ... FROM stdin`, which expects the data to immediately follow in the file. Note that `\.` is the end-of-data marker.

`COPY` can also be used to read a `.csv` file, and can handle other delimiters as well.

Now to load the data, we use the same procedure as before.
```
$ psql
=# CREATE DATABASE coolkids;
=# \q
$ psql coolkids < path/to/customers.sql
$ psql coolkids
```


### write queries to find the following
    
#### i) Return the city with the highest population growth. (Highest net of people who currently live there minus people who used to live there)

#### ii) Return pairs of "friends" (can be two columns or a tuple) that have both the same hometown and current city. Remove duplicates!


## 1a) readychef answers

## i: total meals bought by each user (including users who bought zero meals)

this query gets total meals for every user
but only for users who bought at least 1 meal

```sql
SELECT userid, COUNT(*) as ct 
FROM events 
WHERE event='bought' 
GROUP BY userid
```


 to get the 0-meal users, we LEFT JOIN the 
 user column from "users" with the previous
 table, then use COALESCE to convert the
 NULL values to 0

```sql
SELECT u.userid, COALESCE(c.ct,0) AS total_meals_bought FROM users AS u
LEFT JOIN (
        SELECT userid, COUNT(*) as ct 
        FROM events 
        WHERE event='bought' 
        GROUP BY userid ) as c 
        ON u.userid = c.userid
ORDER BY u.userid
LIMIT 20;
```


### ii: Total money spent by each user

```sql
SELECT u.userid, COALESCE(s.spent,0) AS total_spent FROM users AS u
LEFT JOIN (
        SELECT userid, SUM(m.price) as spent 
        FROM events AS e
        JOIN meals AS m
         ON e.meal_id = m.meal_id
        WHERE event='bought' 
        GROUP BY userid ) as s
        ON u.userid = s.userid
ORDER BY u.userid
LIMIT 20;
```

### iii: Total visits from each user

```sql
SELECT u.userid, COALESCE(v.ct,0) AS total_visits FROM users AS u
LEFT JOIN (
        SELECT userid, COUNT(*) as ct 
        FROM visits 
        GROUP BY userid ) as v 
        ON u.userid = v.userid
ORDER BY u.userid
LIMIT 20;
```

### iv: Average visits per month from each user

(hint: http://www.sqlines.com/postgresql/how-to/datediff )

```sql

WITH user_life AS
(
    SELECT userid,
      ((DATE_PART('year', a.max_dt) - DATE_PART('year', a.min_dt)) * 12 +
      (DATE_PART('month', a.max_dt) - DATE_PART('month', a.min_dt)) + 1) AS user_months
     FROM
        (
        SELECT userid, MIN(dt) as min_dt, MAX(dt) AS max_dt 
        FROM visits GROUP BY userid
        ) AS a
),
total AS
(
    SELECT u.userid, COALESCE(v.ct,0) AS total_visits FROM users AS u
    LEFT JOIN (
            SELECT userid, COUNT(*) as ct 
            FROM visits 
            GROUP BY userid ) as v 
         ON u.userid = v.userid
)
SELECT uu.userid, ROUND(tt.total_visits/uu.user_months :: numeric, 2)
FROM user_life AS uu
JOIN total AS tt
ON uu.userid = tt.userid
LIMIT 10;
```

## 1b) customers answers

### i) Return the city with the highest population growth. (Highest net of people who currently live there minus people who used to live there)

  ```SQL
  CREATE TABLE inward_migration AS
  (SELECT current_city, COUNT(*) AS net_in
  FROM customers
  GROUP BY current_city)

  CREATE TABLE outward_migration AS
  (SELECT hometown, COUNT(*) AS net_out
  FROM customers
  GROUP BY hometown)

  SELECT a.current_city AS city,
    a.net_in - b.net_out AS net_immigration
  FROM inward_migration a
  JOIN outward_migration b
  ON a.current_city = b.hometown
  ORDER BY net_immigration DESC
  LIMIT 1
  ```

### ii) Return pairs of "friends" (can be two columns or a tuple) that have both the same hometown and current city. Remove duplicates!

  ```SQL
  SELECT a.cust_id AS friend1, b.cust_id AS friend2
  FROM customers a
  JOIN customers b
  ON a.hometown = b.hometown
  AND a.current_city = b.current_city
  WHERE a.cust_id < b.cust_id
  ```

# 3. Distributions

http://www.cs.elte.hu/~mesti/valszam/kepletek.pdf

### Beta distribution:
$B(\alpha, \beta) $

$\alpha = 1 + (\#successes)$

$\beta = 1 + (\#failures)$

http://stats.stackexchange.com/a/47782

# 4. Hypothesis testing

https://github.com/gschool/DSI_Lectures/blob/master/ab-testing/tammy_lee/lecture.pdf