In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [2]:
import grader

# SQL Miniproject

## Introduction

The city of New York does restaurant inspections and assigns a grade. Inspections data for the last 4 years are available on s3 [here](s3://dataincubator-course/coursedata/nyc_inspection_data.zip). You can copy it from s3 with this command:

`!aws s3 cp s3://dataincubator-course/coursedata/nyc_inspection_data.zip .`

The file `RI_Webextract_BigApps_Latest.xls` contains a description of each of the datafiles.  Take a look and then load the csv formatted `*.txt` files into
a database as five tables:
1. `actions`
2. `cuisines`
3. `violations`
4. `grades` (from `WebExtract.txt`)
5. `boroughs` (from `RI_Webextract_BigApps_Latest.xls`)

In [3]:
!aws s3 cp s3://dataincubator-course/coursedata/nyc_inspection_data.zip .

download: s3://dataincubator-course/coursedata/nyc_inspection_data.zip to ./nyc_inspection_data.zip


## SQLite3

The project should be written in SQL. Between SQLite and PostgreSQL we recommend sqlite3 for this project.  You can use the sqlite command prompt by running this command in bash
```bash
sqlite3 cmd "DROP TABLE IF EXISTS writer;\
CREATE TABLE IF NOT EXISTS writer (first_name, last_name, year);\
INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);\
INSERT INTO writer VALUES ('Francis', 'Fitzgerald', 1896);\
\
SELECT * FROM writer;\
"
```
Alternatively, you can run bash commands in a jupyter notebook by prepending the `!` in a code cell (notice that we conveniently get the output displayed

In [4]:
!sqlite3 cmd """\
DROP TABLE IF EXISTS writer;\
CREATE TABLE IF NOT EXISTS writer (first_name, last_name, year);\
INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);\
INSERT INTO writer VALUES ('Francis', 'Fitzgerald', 1896);\
\
SELECT * FROM writer;\
"""

William|Shakespeare|1616
Francis|Fitzgerald|1896


Finally, we use the [ipython-sql extension](https://github.com/catherinedevlin/ipython-sql#ipython-sql) by first loaidng the sql extension and then running our code with the "magic" command in the first line
```python
%%sql sqlite://
```
Notice that the output table is formatted nicely as a nice HTML table.

In [5]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [6]:
%%sql sqlite://
DROP TABLE IF EXISTS writer;
CREATE TABLE IF NOT EXISTS writer (first_name, last_name, year);
INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);
INSERT INTO writer VALUES ('Francis', 'Fitzgerald', 1896);

SELECT * FROM writer;

Done.
Done.
1 rows affected.
1 rows affected.
Done.


first_name,last_name,year
William,Shakespeare,1616
Francis,Fitzgerald,1896


## Loading data


The Sqlite3 has a convenient [`.import` function](https://sqlite.org/cli.html#csv_import) which can create tables from `.csv` files.

```bash
sqlite> .import sample.csv.nogit sample
sqlite> SELECT * FROM sample;
```

The files may contain malformatted text.  Unfortunately, this is all too common.  As a stop gap, remember that [`iconv`](https://linux.die.net/man/1/iconv) is a unix utility that can convert files between different text encodings.

Alternatively, you can also read csv files using pandas and convert that into SQL via some sql magic.

Also be aware that the `WebExtract.txt` file contains duplicated data. Multiple rows with identical `CAMIS` and `INSPDATE` values should be reduced to a single row. You will need the `SCORE`, `ZIPCODE`, `BORO`, and `CURRENTGRADE` columns for this miniproject. Make sure that you use a non-null value from the multiple rows for each of these columns when reducing to a single row.

In [7]:
!printf "Name,Age\nAlice,3\nBob,10" > sample.csv.nogit

In [8]:
import pandas as pd
sample = pd.read_csv('sample.csv.nogit')
%sql DROP TABLE IF EXISTS sample
%sql PERSIST sample
%sql SELECT * FROM sample;

Done.
Done.


index,Name,Age
0,Alice,3
1,Bob,10


## For our data, load with encoding 'latin-1' or 'windows-1252'

1. actions
2. cuisines
3. violations
4. grades (from WebExtract.txt)
5. boroughs (from RI_Webextract_BigApps_Latest.xls)


In [9]:
grades = pd.read_csv('WebExtract.txt', encoding = 'latin-1')
%sql DROP TABLE IF EXISTS grades
%sql PERSIST grades

  interactivity=interactivity, compiler=compiler, result=result)


Done.


u'Persisted grades'

In [10]:
%sql SELECT * FROM grades LIMIT 5;

Done.


index,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINECODE,INSPDATE,ACTION,VIOLCODE,SCORE,CURRENTGRADE,GRADEDATE,RECORDDATE
0,30075445,MORRIS PARK BAKE SHOP,2,1007,MORRIS PARK AVE,10462.0,7188924968,8,2014-03-03 00:00:00,D,10F,2.0,A,2014-03-03 00:00:00,2014-09-04 06:01:28.403000000
1,30112340,WENDY'S,3,469,FLATBUSH AVENUE,11225.0,7182875005,39,2014-07-01 00:00:00,F,06A,23.0,B,2014-07-01 00:00:00,2014-09-04 06:01:28.403000000
2,30191841,DJ REYNOLDS PUB AND RESTAURANT,1,351,WEST 57 STREET,10019.0,2122452912,3,2013-07-22 00:00:00,D,10B,11.0,A,2013-07-22 00:00:00,2014-09-04 06:01:28.403000000
3,40356483,WILKEN'S FINE FOOD,3,7114,AVENUE U,11234.0,7184443838,27,2014-05-29 00:00:00,D,08C,10.0,A,2014-05-29 00:00:00,2014-09-04 06:01:28.403000000
4,30191841,DJ REYNOLDS PUB AND RESTAURANT,1,351,WEST 57 STREET,10019.0,2122452912,3,2013-07-22 00:00:00,D,02G,11.0,A,2013-07-22 00:00:00,2014-09-04 06:01:28.403000000


## Question 1: null_entries

Return the number of entries in the grades table with a blank score. Remove those rows from the dataset for the rest of the questions in the assignment.

**Question:** How else might we have handled this?

In [11]:
%%sql SELECT COUNT(*) FROM grades
WHERE score is NULL

Done.


COUNT(*)
33524


In [24]:
%sql DROP TABLE IF EXISTS grad
%sql CREATE TABLE grad AS SELECT CAMIS, INSPDATE, SCORE, ZIPCODE, BORO, CURRENTGRADE FROM grades

Done.
Done.


[]

In [48]:
%sql DROP TABLE IF EXISTS grad1

Done.


[]

In [49]:
%%sql CREATE TABLE grad1 AS
SELECT CAMIS, INSPDATE,  MAX(SCORE) AS SCORE,  ZIPCODE, BORO, CURRENTGRADE
FROM grad
GROUP BY CAMIS, INSPDATE

Done.


[]

In [50]:
%sql SELECT COUNT(*) FROM grad1 WHERE score is NULL

Done.


COUNT(*)
8255


In [47]:
def null_entries():
    return 8255

grader.score('sql__null_entries', null_entries)

Your score:  1


In [51]:
%%sql UPDATE grad1
SET SCORE = 0
WHERE SCORE is NULL;

8255 rows affected.


[]

In [52]:
%sql select * from grad1 limit 5

Done.


CAMIS,INSPDATE,SCORE,ZIPCODE,BORO,CURRENTGRADE
30075445,2011-03-10 00:00:00,14.0,10462.0,2,B
30075445,2011-04-27 00:00:00,0.0,10462.0,2,
30075445,2011-11-12 00:00:00,0.0,10462.0,2,
30075445,2011-11-23 00:00:00,9.0,10462.0,2,A
30075445,2011-12-21 00:00:00,0.0,10462.0,2,


## Question 2: score_by_zipcode

Return a list of tuples of the form:

    (zipcode, mean score, number of restaurants)

for each of the 92 zipcodes in the city with over 100 restaurants. Use the score from the latest inspection date for each restaurant. Sort the list in ascending order by mean score.

**Note:** There is an interesting discussion here about what the mean score *means* in this dataset. Think about what we're actually calculating - does it represent what we're trying to understand about these zipcodes?

What if we use the average of a restaurant's inspections instead of the latest?

**Checkpoints:**
- Total unique restaurants: 25,232;
- Total restaurants in valid zipcodes: 20,349

In [104]:
%sql drop table if exists grad2

Done.


[]

In [105]:
%%sql CREATE TABLE grad2 AS
SELECT CAMIS, MAX(INSPDATE) as inspdate,  SCORE,  ZIPCODE
FROM grad1
GROUP BY CAMIS

Done.


[]

In [106]:
%sql select * from grad2 limit 5

Done.


CAMIS,inspdate,SCORE,ZIPCODE
30075445,2014-03-03 00:00:00,2.0,10462.0
30112340,2014-07-01 00:00:00,23.0,11225.0
30191841,2013-07-22 00:00:00,11.0,10019.0
40356018,2014-06-10 00:00:00,5.0,11224.0
40356068,2014-01-29 00:00:00,12.0,11374.0


In [107]:
%sql drop table if exists grad2a

Done.


[]

In [108]:
%%sql create table grad2a as
select zipcode, avg(score) as meanscore, count(distinct camis) as ncam from grad2 
where zipcode is not null
group by zipcode
order by meanscore asc;

Done.


[]

In [109]:
my = %sql SELECT * FROM grad2a WHERE ncam > 100

Done.


In [110]:
print len(my)
mylist =[x for x in my]
#print mylist[:5]
result = [ ( str(int(a)).zfill(5), b, c) for a, b, c in mylist]
print result[:5]

92
[('10001', 8.697445972495089, 509), ('10451', 8.78343949044586, 157), ('10452', 8.841584158415841, 101), ('10004', 8.992957746478874, 142), ('10007', 9.23021582733813, 139)]


In [116]:
for r in result:
    for i in r:
        print i,',',
    print
    

10001 , 8.6974459725 , 509 ,
10451 , 8.78343949045 , 157 ,
10452 , 8.84158415842 , 101 ,
10004 , 8.99295774648 , 142 ,
10007 , 9.23021582734 , 139 ,
11236 , 9.24324324324 , 111 ,
11234 , 9.39473684211 , 152 ,
11430 , 9.49324324324 , 148 ,
11207 , 9.625 , 136 ,
11209 , 9.6553030303 , 264 ,
11231 , 9.66037735849 , 159 ,
11238 , 9.66260162602 , 246 ,
11217 , 9.68897637795 , 254 ,
10472 , 9.74074074074 , 108 ,
10306 , 9.74774774775 , 111 ,
11101 , 9.752 , 250 ,
11218 , 9.80272108844 , 147 ,
11201 , 9.81739130435 , 345 ,
11369 , 9.82692307692 , 104 ,
10301 , 9.95145631068 , 103 ,
10468 , 9.96261682243 , 107 ,
10065 , 9.97687861272 , 173 ,
10461 , 9.98701298701 , 154 ,
11222 , 9.99481865285 , 193 ,
11368 , 10.0448275862 , 290 ,
10023 , 10.0710659898 , 197 ,
11105 , 10.1138211382 , 123 ,
11225 , 10.1782178218 , 101 ,
11237 , 10.2346368715 , 179 ,
11361 , 10.2857142857 , 119 ,
10458 , 10.2908163265 , 196 ,
10019 , 10.3151515152 , 660 ,
10462 , 10.3243243243 , 148 ,
11103 , 10.3636363636 , 209 

In [112]:
def score_by_zipcode():
    return result
#[("11201", 9.81739130434783, 345)] * 92

grader.score('sql__score_by_zipcode', score_by_zipcode)

Your score:  1.0


## Question 3: score_by_map

The above are not terribly enlightening.  Use [CartoDB](http://cartodb.com/) to produce a map of average scores by zip code.  You can sign up for a free trial.

You will have to use their wizard to plot the data by [zipcode](https://carto.com/learn/guides/analysis/georeference). You will need to specify "USA" in the countryfield.  Then use the "share" button to return a link of the form [https://x.cartodb.com/](https://x.cartodb.com/).

**For fun:** How do JFK, Brighton Beach, Liberty Island (home of the Statue of Liberty), Financial District, Chinatown, and Coney Island fare?

**For more fun:** Plot restaurants as pins on the map, allowing the user to filter by "low", "middling", or "high"-scoring restaurants. You can use a CASE WHEN statement to create the different groups based on score thresholds.

In [117]:
def score_by_map():
    # must be url of the form https://x.cartodb.com/...
    return "https://xiangscode.carto.com/builder/f944d7f1-453b-4332-914e-8ca5db3bdd43/embed"#"https://cartodb.com"

grader.score('sql__score_by_map', score_by_map)

Your score:  1.0


## Question 4: score_by_borough
Return a list of tuples of the form:

    (borough, mean score, number of restaurants)

for each of the city's five boroughs. Sort the list in ascending order by grade.

**Hint:** You will have to perform a join with the `boroughs` table. The borough names should be reported in ALL CAPS.

**Checkpoint:**
- Total restaurants in valid boroughs: 25,220

In [118]:
%sql select * from grad1 limit 10

Done.


CAMIS,INSPDATE,SCORE,ZIPCODE,BORO,CURRENTGRADE
30075445,2011-03-10 00:00:00,14.0,10462.0,2,B
30075445,2011-04-27 00:00:00,0.0,10462.0,2,
30075445,2011-11-12 00:00:00,0.0,10462.0,2,
30075445,2011-11-23 00:00:00,9.0,10462.0,2,A
30075445,2011-12-21 00:00:00,0.0,10462.0,2,
30075445,2012-05-03 00:00:00,0.0,10462.0,2,
30075445,2012-12-31 00:00:00,25.0,10462.0,2,
30075445,2013-01-24 00:00:00,10.0,10462.0,2,A
30075445,2013-06-01 00:00:00,0.0,10462.0,2,
30075445,2013-08-14 00:00:00,32.0,10462.0,2,


In [138]:
%%sql
DROP TABLE IF EXISTS boroughs;
CREATE TABLE boroughs (
    IDX               INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    BORO              TEXT NOT NULL,
    BORONAME          TEXT NOT NULL
);

Done.
Done.


[]

In [139]:
%%sql

INSERT INTO boroughs (BORO, BORONAME)
    VALUES ('1', 'MANHATTAN');

INSERT INTO boroughs (BORO, BORONAME)
    VALUES ('2', 'THE BRONX');

INSERT INTO boroughs (BORO, BORONAME)
    VALUES ('3', 'BROOKLYN');

INSERT INTO boroughs (BORO, BORONAME)
    VALUES ('4', 'QUEENS');

INSERT INTO boroughs (BORO, BORONAME)
    VALUES ('5', 'STATEN ISLAND');
    

1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.


[]

In [140]:
%sql select * from boroughs

Done.


IDX,BORO,BORONAME
1,1,MANHATTAN
2,2,THE BRONX
3,3,BROOKLYN
4,4,QUEENS
5,5,STATEN ISLAND


In [150]:
%sql DROP TABLE IF EXISTS grad3

Done.


[]

In [151]:
%%sql CREATE TABLE grad3 AS
SELECT grad1.BORO as BOROID, b.BORONAME as BORONAME, grad1.SCORE as SCORE, 
grad1.CAMIS as CAMIS, max(grad1.INSPDATE) as INSPDATE, grad1.CURRENTGRADE as GRADE
FROM grad1 JOIN boroughs as b ON grad1.BORO = b.BORO
GROUP BY CAMIS

Done.


[]

In [152]:
%sql select * from grad3 limit 10

Done.


BOROID,BORONAME,SCORE,CAMIS,INSPDATE,GRADE
2,THE BRONX,2.0,30075445,2014-03-03 00:00:00,A
3,BROOKLYN,23.0,30112340,2014-07-01 00:00:00,B
1,MANHATTAN,11.0,30191841,2013-07-22 00:00:00,A
3,BROOKLYN,5.0,40356018,2014-06-10 00:00:00,A
4,QUEENS,12.0,40356068,2014-01-29 00:00:00,
4,QUEENS,10.0,40356151,2014-05-02 00:00:00,A
5,STATEN ISLAND,12.0,40356442,2014-05-20 00:00:00,A
3,BROOKLYN,10.0,40356483,2014-05-29 00:00:00,A
3,BROOKLYN,12.0,40356649,2014-07-18 00:00:00,A
3,BROOKLYN,12.0,40356731,2014-07-14 00:00:00,A


In [154]:
q4 = %sql select BORONAME, AVG(SCORE) as SC, COUNT(CAMIS) FROM grad3 GROUP BY BORONAME ORDER BY SC ASC

Done.


In [155]:
lq4 =[x for x in q4]
ans4 = [ (a, b, c) for a, b, c in lq4]
print ans4

[(u'THE BRONX', 10.069767441860465, 2365), (u'BROOKLYN', 10.702075098814229, 6072), (u'MANHATTAN', 10.721693951573375, 10201), (u'STATEN ISLAND', 10.74712643678161, 957), (u'QUEENS', 11.059733333333334, 5625)]


In [156]:
def score_by_borough():
    return ans4#[("MANHATTAN", 10.7269875502402, 10201)] * 5

grader.score('sql__score_by_borough', score_by_borough)

Your score:  1.0


## Question 5: score_by_cuisine

Return a list of the 75 tuples of the form

    (cuisine, mean score, number of reports)

for each of the 75 cuisine types with at least 100 violation reports. Sort the list in ascending order by score. Are the least sanitary and most sanitary
cuisine types surprising?

**Note:** It's interesting to think again about what this analysis is trying to say and how it differs from the analysis by zipcode. How should this
affect the calculation in your opinion?

**Checkpoint:**
- Total entries from valid cuisines: 531,529

In [157]:
cuisine = pd.read_csv('Cuisine.txt', encoding = 'latin-1')
%sql DROP TABLE IF EXISTS cuisine
%sql PERSIST cuisine

Done.


u'Persisted cuisine'

In [158]:
%sql select * from cuisine limit 5

Done.


index,CUISINECODE,CODEDESC
0,2,African
1,3,American
2,5,Asian
3,15,Cajun
4,17,Caribbean


In [164]:
%sql select * from grades limit 2

Done.


index,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINECODE,INSPDATE,ACTION,VIOLCODE,SCORE,CURRENTGRADE,GRADEDATE,RECORDDATE
0,30075445,MORRIS PARK BAKE SHOP,2,1007,MORRIS PARK AVE,10462.0,7188924968,8,2014-03-03 00:00:00,D,10F,2.0,A,2014-03-03 00:00:00,2014-09-04 06:01:28.403000000
1,30112340,WENDY'S,3,469,FLATBUSH AVENUE,11225.0,7182875005,39,2014-07-01 00:00:00,F,06A,23.0,B,2014-07-01 00:00:00,2014-09-04 06:01:28.403000000


In [167]:
%sql SELECT COUNT(*) FROM grades WHERE score is NULL

Done.


COUNT(*)
33524


In [168]:
%%sql UPDATE grades
SET SCORE = 0
WHERE SCORE is NULL;

33524 rows affected.


[]

In [179]:
%sql DROP TABLE IF EXISTS grad5

Done.


[]

In [180]:
%%sql CREATE TABLE grad5 AS
SELECT  AVG(SCORE) as MEANSCORE, CUISINECODE, count(VIOLCODE) as NUMVIOL, CURRENTGRADE
FROM grades
GROUP BY CUISINECODE
HAVING NUMVIOL >= 100
ORDER BY MEANSCORE ASC

Done.


[]

In [181]:
%sql select count(*) from grad5 

Done.


count(*)
75


In [182]:
%sql select * from grad5 limit 5

Done.


MEANSCORE,CUISINECODE,NUMVIOL,CURRENTGRADE
11.0510607695,99,1881,
13.4821428571,42,145,A
14.2811447811,75,576,
14.63,33,297,A
14.6521941386,29,6303,


In [183]:
%%sql CREATE TABLE grad5a AS
SELECT c.CODEDESC as CUISINE, g.MEANSCORE as MEANSCORE, g.NUMVIOL as VIOL
FROM grad5 as g JOIN cuisine as c ON g.CUISINECODE = c.CUISINECODE

Done.


[]

In [186]:
%sql select * from grad5a ORDER BY MEANSCORE limit 5

Done.


CUISINE,MEANSCORE,VIOL
Other,11.0510607695,1881
Hotdogs/Pretzels,13.4821428571,145
Soups & Sandwiches,14.2811447811,576
Ethiopian,14.63,297
Donuts,14.6521941386,6303


In [187]:
q5 = %sql select * from grad5a ORDER BY MEANSCORE

Done.


In [188]:
lq5 =[x for x in q5]
ans5 = [ (a, b, c) for a, b, c in lq5]

In [189]:
ans5

[(u'Other', 11.051060769507371, 1881),
 (u'Hotdogs/Pretzels', 13.482142857142858, 145),
 (u'Soups & Sandwiches', 14.281144781144782, 576),
 (u'Ethiopian', 14.63, 297),
 (u'Donuts', 14.652194138626143, 6303),
 (u'Hotdogs', 14.71822033898305, 433),
 (u'Ice Cream, Gelato, Yogurt, Ices', 14.730250990752973, 3670),
 (u'Armenian', 15.079268292682928, 632),
 (u'Cajun', 15.548387096774194, 179),
 (u'Egyptian', 15.662125340599456, 348),
 (u'Sandwiches', 15.712999099369558, 6510),
 (u'Caf\xe9/Coffee/Tea', 15.994683295044306, 14860),
 (u'Juice, Smoothies, Fruit Salads', 16.003679175864605, 2638),
 (u'Sandwiches/Salads/Mixed Buffet', 16.340146917917597, 3057),
 (u'Hamburgers', 16.577529232333504, 7663),
 (u'Bottled beverages, including water, sodas, juices, etc.',
  16.74928774928775,
  1021),
 (u'English', 17.33625730994152, 334),
 (u'Middle Eastern', 17.706328375619716, 3313),
 (u'Chicken', 17.836702546600158, 7442),
 (u'Southwestern', 18.004830917874397, 203),
 (u'German', 18.627986348122867, 8

In [190]:
def score_by_cuisine():
    return ans5#[("French", 20.3550686378036, 7576)] * 75

grader.score('sql__score_by_cuisine', score_by_cuisine)

Your score:  0.96


## Question 6: violation_by_cuisine
Which cuisines tend to have a disproportionate number of what which violations? Answering this question isn't easy becuase you have to think carefully about normalizations.

1. More popular cuisine categories will tend to have more violations just becuase they represent more restaurants.
2. Similarly, some violations are more common.  For example, knowing that "Equipment not easily movable or sealed to floor" is a common violation for Chinese restuarants is not particularly helpful when it is a common violation for all restaurants.

### The right quantity is to look at is the conditional probability of a specific type of violation given a specific cuisine type and divide it by the unconditional probability of the violation for the entire population. Taking this ratio gives the right answer.  Return the 20 highest ratios of the form:

    ((cuisine, violation), ratio, count)

**Hint:**
1. You might want to check out this [Stackoverflow post](http://stackoverflow.com/questions/972877/calculate-frequency-using-sql).
2. The definition of a violation changes with time.  For example, 10A can mean two different things "Toilet facility not maintained ..." or "Vermin or other live animal present ..." when things were prior to 2003. To deal with this, you should limit your analysis to ** violation codes with end date after Jan 1, 2014.** (This end date refers to the validity time ranges in Violation.txt).
3. The ratios don't mean much when the ** number of violations of a given type and for a specific category** are not large (why not?).  Be sure to filter these out.  We chose 100 as our cutoff.

**Checkpoint:**
- Top 20 ratios mean: 2.37009216349859

In [199]:
violation = pd.read_csv("Violation.txt", sep = '","|$"|"^')
violation['VIOLATIONDESC"'] = violation['VIOLATIONDESC"'].str.decode('windows-1252')
#violation = pd.read_csv('Violation.txt', encoding = 'latin-1')
%sql DROP TABLE IF EXISTS violation
%sql PERSIST violation

Done.


  """Entry point for launching an IPython kernel.


u'Persisted violation'

In [200]:
%sql select * from violation limit 5

Done.


index,"""STARTDATE",ENDDATE,CRITICALFLAG,VIOLATIONCODE,"VIOLATIONDESC"""
0,"""1901-01-01 00:00:00",2003-03-23 00:00:00,Y,01A,"Current valid <a onmouseover=""ShowContent('P2','01A'); return true;"" href=""javascript:ShowContent('P2','01A')"">permit</A> , registration or other authorization to operate establishment not available."""
1,"""2003-03-24 00:00:00",2005-02-17 00:00:00,Y,01A,"Current valid <a onmouseover=""ShowContent('P2','01A'); return true;"" href=""javascript:ShowContent('P2','01A')"">permit</A> , registration or other authorization to operate establishment not available."""
2,"""2005-02-18 00:00:00",2007-06-30 00:00:00,Y,01A,"Current valid <a onmouseover=""ShowContent('P2','01A'); return true;"" href=""javascript:ShowContent('P2','01A')"">permit</A> , registration or other authorization to operate establishment not available."""
3,"""2007-07-01 00:00:00",2008-06-30 00:00:00,Y,01A,"Current valid permit, registration or other authorization to operate establishment not available. Violations points are not assessed for Smoke Free Air Act, trans fat, calorie posting or permit and poster violations."""
4,"""2008-07-01 00:00:00",2009-08-01 00:00:00,Y,01A,"Current valid permit, registration or other authorization to operate establishment not available. Violations points are not assessed for Smoke Free Air Act, trans fat, calorie posting or permit and poster violations."""


In [201]:
%sql select * from grades limit 2

Done.


index,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINECODE,INSPDATE,ACTION,VIOLCODE,SCORE,CURRENTGRADE,GRADEDATE,RECORDDATE
0,30075445,MORRIS PARK BAKE SHOP,2,1007,MORRIS PARK AVE,10462.0,7188924968,8,2014-03-03 00:00:00,D,10F,2.0,A,2014-03-03 00:00:00,2014-09-04 06:01:28.403000000
1,30112340,WENDY'S,3,469,FLATBUSH AVENUE,11225.0,7182875005,39,2014-07-01 00:00:00,F,06A,23.0,B,2014-07-01 00:00:00,2014-09-04 06:01:28.403000000


In [204]:
%sql select * from cuisine limit 5

Done.


index,CUISINECODE,CODEDESC
0,2,African
1,3,American
2,5,Asian
3,15,Cajun
4,17,Caribbean


In [290]:
%sql DROP TABLE IF EXISTS grad6

Done.


[]

In [291]:
%%sql CREATE TABLE grad6 AS
SELECT g.CAMIS as CAMIS, g.CUISINECODE as CUISINECODE, g.VIOLCODE as VIOLATIONCODE, 
v.ENDDATE as ENDDATE, v.'VIOLATIONDESC"' as VIOLATIONDESC
FROM grades as g LEFT JOIN violation as v 
ON g.VIOLCODE=v.VIOLATIONCODE
WHERE v.ENDDATE > '2014-01-01'

Done.


[]

In [294]:
%sql select count(*) from grad6 where VIOLATIONCODE is not NULL 

Done.


count(*)
520714


In [251]:
%sql select * from grad6 limit 5

Done.


CAMIS,CUISINECODE,VIOLATIONCODE,ENDDATE,VIOLATIONDESC
40364467,3,02A,2099-12-31 00:00:00,"Food not cooked to required minimum temperature."""
40365361,3,02A,2099-12-31 00:00:00,"Food not cooked to required minimum temperature."""
40365942,20,02A,2099-12-31 00:00:00,"Food not cooked to required minimum temperature."""
40367544,77,02A,2099-12-31 00:00:00,"Food not cooked to required minimum temperature."""
40368318,3,02A,2099-12-31 00:00:00,"Food not cooked to required minimum temperature."""


In [266]:
%sql DROP TABLE IF EXISTS grad6c
%sql CREATE TABLE grad6c AS SELECT VIOLATIONCODE, COUNT(*) as COUNTV FROM grad6 GROUP BY VIOLATIONCODE

Done.
Done.


[]

In [283]:
%sql select * from grad6c limit 5

Done.


VIOLATIONCODE,COUNTV
02A,498
02B,24523
02C,550
02D,52
02E,19


In [295]:
%sql DROP TABLE IF EXISTS grad6a

Done.


[]

In [296]:
%%sql create table grad6a AS
Select A.VIOLATIONCODE as VIOLATIONCODE, A.VIOLATIONDESC as VIOLATIONDESC, A.CUISINECODE, A.COUNTVC * 1.0 / B.COUNTC As PROBVC, A.COUNTVC as COUNT
From    (
        Select VIOLATIONCODE, CUISINECODE, Count(*) As COUNTVC, VIOLATIONDESC
        From   grad6
        Group By VIOLATIONCODE, CUISINECODE
        Having COUNTVC > 100
        ) As A
        Inner Join (
            Select CUISINECODE, Count(*) As COUNTC
            From   grad6
            Group By CUISINECODE
            ) As B
            ON A.CUISINECODE = B.CUISINECODE

Done.


[]

In [297]:
%sql select * from grad6a limit 5

Done.


VIOLATIONCODE,VIOLATIONDESC,CUISINECODE,PROBVC,COUNT
02A,"Food not cooked to required minimum temperature.""",3,0.00103624464939,130
02B,"Hot food item not held at or above 140º F.""",2,0.0770482908302,142
02B,"Hot food item not held at or above 140º F.""",3,0.0305771882697,3836
02B,"Hot food item not held at or above 140º F.""",5,0.0556571727201,426
02B,"Hot food item not held at or above 140º F.""",8,0.0592546583851,954


In [298]:
%sql DROP TABLE IF EXISTS grad6b

Done.


[]

In [299]:
%%sql create table grad6b as
select a.VIOLATIONCODE, a.VIOLATIONDESC, a.CUISINECODE, (a.PROBVC /(x.COUNTV / 520714.0))  as RATIO, a.COUNT, c.CODEDESC
from grad6a as a 
JOIN cuisine as c ON a.CUISINECODE=c.CUISINECODE
JOIN grad6c as x ON a.VIOLATIONCODE=x.VIOLATIONCODE
order by RATIO DESC

Done.


[]

In [300]:
%sql select * from grad6b limit 5

Done.


VIOLATIONCODE,VIOLATIONDESC,CUISINECODE,RATIO,COUNT,CODEDESC
04C,"Food worker does not use proper utensil to eliminate bare hand contact with food that will not receive adequate additional heat treatment.""",49,3.24413628229,541,Japanese
20D,"“Choking first aid” poster not posted. “Alcohol and pregnancy” warning sign not posted. Resuscitation equipment: exhaled air resuscitation masks (adult & pediatric), latex gloves, sign not posted. Inspection report sign not posted.""",14,3.15281790283,175,Café/Coffee/Tea
04A,"Food Protection Certificate not held by supervisor of food operations.""",51,3.08954068739,145,"Juice, Smoothies, Fruit Salads"
10E,"Accurate thermometer not provided in refrigerated or hot holding equipment.""",29,3.037267501,130,Donuts
04A,"Food Protection Certificate not held by supervisor of food operations.""",43,2.9559150772,193,"Ice Cream, Gelato, Yogurt, Ices"


In [301]:
q6 = %sql select CODEDESC, VIOLATIONDESC, RATIO, COUNT from grad6b limit 20

Done.


In [302]:
lq6 =[x for x in q6]
ans6 = [ ((a, b), c, d) for a, b, c, d in lq6]
for a in ans6:
    print a

((u'Japanese', u'Food worker does not use proper utensil to eliminate bare hand contact with food that will not receive adequate additional heat treatment."'), 3.2441362822892623, 541)
((u'Juice, Smoothies, Fruit Salads', u'Food Protection Certificate not held by supervisor of food operations."'), 3.089540687389437, 145)
((u'Donuts', u'Accurate thermometer not provided in refrigerated or hot holding equipment."'), 3.0372675010032575, 130)
((u'Ice Cream, Gelato, Yogurt, Ices', u'Food Protection Certificate not held by supervisor of food operations."'), 2.955915077202543, 193)
((u'Thai', u'Thawing procedures improper."'), 2.6329639915169523, 151)
((u'Irish', u'Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan."'), 2.3692776671873292, 321)
((u'Mexican', u'Food not cooled by an approved method whereby the internal product temperature is reduced from 140\xba F to 70\xba F or less within 2 hours, and from 70\xba F to

## Rich's solution

In [None]:
%%sql

WITH uncon_prob AS
  (SELECT VIOLCODE, COUNT(*) * 1.0/ (SELECT COUNT(*) FROM events) as uncond 
   FROM events GROUP BY 1),
cuis_count AS
  (SELECT CUISINECODE AS CUISINECODE_C, COUNT(*) as counts FROM events GROUP BY 1),
viol_cuis_count AS
  (SELECT VIOLCODE, CUISINECODE, COUNT(*) as counts 
   FROM events GROUP BY VIOLCODE, CUISINECODE),
con_prob AS
  (SELECT VIOLCODE, CUISINECODE, viol_cuis_count.counts *1.0 / cuis_count.counts AS cond,
   viol_cuis_count.counts AS counts
    FROM viol_cuis_count JOIN cuis_count ON CUISINECODE = CUISINECODE_C),
unlabeled AS 
  (SELECT CUISINECODE, uncon_prob.VIOLCODE, cond / uncond AS ratio, con_prob.counts
   FROM con_prob JOIN uncon_prob ON uncon_prob.VIOLCODE = con_prob.VIOLCODE 
   WHERE counts > 100
   ORDER BY ratio DESC LIMIT 20)
SELECT CODEDESC, VIOLATIONDESC, ratio, counts 
  FROM (unlabeled JOIN violations ON unlabeled.VIOLCODE = violations.VIOLATIONCODE)
  JOIN cuisine ON unlabeled.CUISINECODE = cuisine.CUISINECODE 
  WHERE violations.ENDDATE > '2014-01-01 00:00:00';

In [None]:
result = _
formatted_result = [((x,y),z,w) for (x,y,z,w) in list(result)]

In [303]:
def violation_by_cuisine():
    return  ans6

           #[(("Café/Coffee/Tea",
           #   "Toilet facility not maintained and provided with toilet paper; "
           #   "waste receptacle and self-closing door."),
           #   1.87684775827172, 315)] * 20

grader.score('sql__violation_by_cuisine', violation_by_cuisine)

Your score:  0.95


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*