# 00. Study previous assessments

# 0. Hey! Remember to use the tests!!!!
When you clone the assessment repo and you are in the `final-assessment-files` directory, make sure unittests works by entering `make test` in your terminal.

# 1. SQL: readychef

If you still have the readychef database, you can access it by entering `psql readychef` in your terminal.

If the above command fails, you'll have to create the database and load it with the following commands.
Here, `$` represents the terminal shell and `=#` represents the postgres shell.
```
$ psql
=# CREATE DATABASE readychef;
=# \q
$ psql -f path/to/readychef.sql readychef
$ psql readychef
=# \d
           List of relations
 Schema |   Name    | Type  |  Owner   
--------+-----------+-------+----------
 public | events    | table | postgres
 public | meals     | table | postgres
 public | referrals | table | postgres
 public | users     | table | postgres
 public | visits    | table | postgres
(5 rows)


```

```
readychef=# SELECT * FROM events LIMIT 10;
     dt     | userid | meal_id | event  
------------+--------+---------+--------
 2013-01-01 |      3 |      18 | bought
 2013-01-01 |      7 |       1 | like
 2013-01-01 |     10 |      29 | bought
 2013-01-01 |     11 |      19 | share
 2013-01-01 |     15 |      33 | like
 2013-01-01 |     18 |       4 | share
 2013-01-01 |     18 |      40 | bought
 2013-01-01 |     21 |      10 | share
 2013-01-01 |     21 |       4 | like
 2013-01-01 |     22 |      23 | bought
```


```
readychef=# SELECT * FROM meals LIMIT 10;
 meal_id |  type   |     dt     | price 
---------+---------+------------+-------
       1 | french  | 2013-01-01 |    10
       2 | chinese | 2013-01-01 |    13
       3 | mexican | 2013-01-02 |     9
       4 | italian | 2013-01-03 |     9
       5 | chinese | 2013-01-03 |    12
       6 | italian | 2013-01-03 |     9
       7 | italian | 2013-01-03 |    10
       8 | french  | 2013-01-03 |    14
       9 | italian | 2013-01-03 |    13
      10 | french  | 2013-01-03 |     7
(10 rows)
```

# Exercise: write queries to find the following
    

### i: total meals bought by each user (including users who bought zero meals)
### ii: Total money spent by each user
### iii: Total visits from each user
### iv: Average visits per month from each user


(try writing them before looking at the answers below)

## i: total meals bought by each user (including users who bought zero meals)

this query gets total meals for every user
but only for users who bought at least 1 meal

```sql
SELECT userid, COUNT(*) as ct 
FROM events 
WHERE event='bought' 
GROUP BY userid
```


 to get the 0-meal users, we LEFT JOIN the 
 user column from "users" with the previous
 table, then use COALESCE to convert the
 NULL values to 0

```sql
SELECT u.userid, COALESCE(c.ct,0) AS total_meals_bought FROM users AS u
LEFT JOIN (
        SELECT userid, COUNT(*) as ct 
        FROM events 
        WHERE event='bought' 
        GROUP BY userid ) as c 
        ON u.userid = c.userid
ORDER BY u.userid
LIMIT 20;
```


### ii: Total money spent by each user

```sql
SELECT u.userid, COALESCE(s.spent,0) AS total_spent FROM users AS u
LEFT JOIN (
        SELECT userid, SUM(m.price) as spent 
        FROM events AS e
        JOIN meals AS m
         ON e.meal_id = m.meal_id
        WHERE event='bought' 
        GROUP BY userid ) as s
        ON u.userid = s.userid
ORDER BY u.userid
LIMIT 20;
```

### iii: Total visits from each user

```sql
SELECT u.userid, COALESCE(v.ct,0) AS total_visits FROM users AS u
LEFT JOIN (
        SELECT userid, COUNT(*) as ct 
        FROM visits 
        GROUP BY userid ) as v 
        ON u.userid = v.userid
ORDER BY u.userid
LIMIT 20;
```

### iv: Average visits per month from each user

(hint: http://www.sqlines.com/postgresql/how-to/datediff )

```sql

WITH user_life AS
(
    SELECT userid,
      ((DATE_PART('year', a.max_dt) - DATE_PART('year', a.min_dt)) * 12 +
      (DATE_PART('month', a.max_dt) - DATE_PART('month', a.min_dt)) + 1) AS user_months
     FROM
        (
        SELECT userid, MIN(dt) as min_dt, MAX(dt) AS max_dt 
        FROM visits GROUP BY userid
        ) AS a
),
total AS
(
    SELECT u.userid, COALESCE(v.ct,0) AS total_visits FROM users AS u
    LEFT JOIN (
            SELECT userid, COUNT(*) as ct 
            FROM visits 
            GROUP BY userid ) as v 
         ON u.userid = v.userid
)
SELECT uu.userid, ROUND(tt.total_visits/uu.user_months :: numeric, 2)
FROM user_life AS uu
JOIN total AS tt
ON uu.userid = tt.userid
LIMIT 10;
```

# 2. Web scraping

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
query = 'data scientist'

In [3]:
url = "http://www.indeed.com/jobs?q={}".format(query.replace(' ', '+'))
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

In [4]:
url

'http://www.indeed.com/jobs?q=data+scientist'

### find the number of results

You can search by tag, attribute, or both

In [5]:
soup.find('div', id='searchCount').text

'\n        Jobs 1 to 10 of 23,550'

In [6]:
soup.find('div', id='searchCount')

<div id="searchCount">
        Jobs 1 to 10 of 23,550</div>

In [7]:
soup.find(id='searchCount')

<div id="searchCount">
        Jobs 1 to 10 of 23,550</div>

In [8]:
soup.find(id='searchCount').text

'\n        Jobs 1 to 10 of 23,550'

In [9]:
soup.find(id='searchCount').text.split()

['Jobs', '1', 'to', '10', 'of', '23,550']

In [10]:
soup.find(id='searchCount').text.split()[-1]

'23,550'

In [11]:
import requests
from bs4 import BeautifulSoup

def number_of_jobs(query):
    '''
    INPUT: string
    OUTPUT: int

    Return the number of jobs on the indeed.com for the search query.
    '''

    url = "http://www.indeed.com/jobs?q={}".format(query.replace(' ', '+'))
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    search_count = soup.find('div', id='searchCount')
    return int(search_count.text.split('of ')[-1].replace(',', ''))

In [12]:
number_of_jobs('data scientist')

23550

In [13]:
soup.find_all('div', class_='row result')

[<div class="row result" data-jk="18f56cbc9a1cee95" id="pj_18f56cbc9a1cee95">
 <!-- Previously this variable was used to indicate job board jobs, we have replaced that with a more accurate source type check -->
 <a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0AKvx1StreRZ4gELVY3ypo3Yc-9qYqTvUfWWDCFukc8LAOVmHxGSbg-Ag6ukb0J2CCuffWw7MhJ6VKNPLLyhyY0hAEcDz4Ml4h7KXZMFRlhAZ8K6LVVsCKKy7wjMPVX177fkFACIuRHiVJtCZfSc9EG0w1TeGCXj3PFZ8fPkjeesCIOmZPZ3UJ5pClDsbiOQfe3m5ZkGLdwUedVehTzaj_W0PinLhYvj63HfRm-1lEgQ1lYJJZJs-_otbWkEOzi3LmCW-aeZjwK-zvT_b060uP31Ac14FUACrpaccp5L53ATgZEFumM6I1xbjDL3NVJwZ7T6w21Mlp9RGkP2XJ3fJoNs6da10X6eQJIvvdE9h4KV61mVW60NmDK5an4OhoDSluR07FUHNyfnuGNHDHhBM4WUJDXQ0eJvHQ8oM3IVVT0_A==&amp;p=1&amp;sk=&amp;fvj=0" id="sja1" onclick="setRefineByCookie([]); sjoc('sja1',0); convCtr('SJ')" onmousedown="sjomd('sja1'); clk('sja1');" rel="noopener nofollow" target="_blank" title="Data Scientist - Machine Learning"><b>Data</b> <b>Scientist</b> - M

# 3. Distributions

http://www.cs.elte.hu/~mesti/valszam/kepletek.pdf

### Beta distribution:
$B(\alpha, \beta) $

$\alpha = 1 + (\#successes)$

$\beta = 1 + (\#failures)$

http://stats.stackexchange.com/a/47782

# 4. Hypothesis testing

https://github.com/gschool/DSI_Lectures/blob/master/ab-testing/tammy_lee/lecture.pdf